EXPLOITING SMOOTHNESS IN STATISTICAL LEARNING, SEQUENTIAL PREDICTION, AND STOCHASTIC OPTIMIZATION By Mehrdad Mahdavi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science and Engineering - Doctor of Philosophy 2014 ABSTRACT EXPLOITING SMOOTHNESS IN STATISTICAL LEARNING, SEQUENTIAL PREDICTION, AND STOCHASTIC OPTIMIZATION By Mehrdad Mahdavi In the last several years, the intimate connection between convex optimization and learning problems, in both statistical and sequential frameworks, has shifted the focus of algorithmic machine learning to examine this interplay. In particular, on one hand, this intertwinement brings forward new challenges in reassessment of the performance of learning algorithms under the assumptions imposed by convexity such as Lipschitzness, strong convexity, and smoothness. On the other hand, emergence of datasets of an unprecedented size, demands the development of efficient optimization algorithms to tackle large-scale learning problems. The overarching goal of this thesis is to reassess the smoothness of loss functions in statistical learning, sequential prediction/online learning, and stochastic optimization and explicate its consequences. In particular we examine how leveraging smoothness of loss function could be beneficial or detrimental in these settings in terms of sample complexity, statistical consistency, regret analysis, and convergence rate. In the statistical learning framework, we investigate the sample complexity of learning problems when the loss function is smooth and strongly convex and the learner is provided with the target risk as a prior knowledge. We establish that under these assumptions, by exploiting the smoothness of loss function, we are able to improve the sample complexity of learning exponentially. We also investigate the smoothness from the viewpoint of statistical consistency and show that in sharp contrast to optimization and generalization where the smoothness is favorable because of its computational and theoretical virtues, the smoothness of surrogate loss function might deteriorate the binary excess risk. Motivated by this negative result, we provide a unified analysis of three types of errors including optimization error, generalization bound, and the error in translating convex excess risk into a binary excess risk, and underline the conditions that smoothness might be preferred. We then turn to elaborate the importance of smoothness in sequential prediction/online learning. We introduce a new measure to assess the performance of online learning algorithms which is referred to as gradual variation. The gradual variation is measured by the sum of the distances between every two consecutive loss functions and is more suitable for gradually evolving environments such as stock prediction. Under smoothness assumption, we devise novel algorithms for online convex optimization with regret bounded by gradual variation. Finally, we investigate how to exploit the smoothness of loss function in convex optimization. We propose a novel optimization paradigm that is referred to as mixed optimization which interpolates between stochastic and full gradient methods and is able to exploit the smoothness of loss functions to obtain faster convergence rates in stochastic optimization, and condition number independent accesses of full gradients in deterministic optimization. We also propose efficient projection-free optimization algorithms to tackle the computational challenge arising from the projection steps which are required at each iteration of most existing gradient based optimization methods to ensure the feasibility of intermediate solutions. In stochastic optimization setting, by introducing and leveraging smoothness, we develop novel methods which only require one projection at the final iteration. In online learning setting, we consider online convex optimization with soft constraints where the constraints are allowed to be satisfied on long term. To my parents, Asieh and Rashid. iv ACKNOWLEDGMENTS First and foremost, I feel indebted to my advisor, Professor Rong Jin, for his guidance, encouragement, and inspiring supervision throughout the course of this research work. His patience, extensive knowledge, and creative thinking have been the source of inspiration for me. He was available for advice or academic help whenever I needed and gently guided me for deeper understanding, no matter how late or inconvenient the time is. When I was struggling to quit my Ph.D. at Sharif University to join Rong’s group, I was not sure about my decision, but after four and a half years, I am happy to say that I did not make a wrong decision. It’s hard to express how thankful I am for his unwavering support over the last years. I would like to take on this opportunity to thank my thesis committee members PanNing Tan, Ambuj Tewari, and Eric Torng who have accommodated my timing constraints despite their full schedules, and provided me with precious feedback for the presentation of the results, in both written and oral form. During my Ph.D. studies, I had the pleasure of collaborating with many researchers from each and every one of which I had things to learn, and the quality of my research was considerably enhanced by these interactions. I would like to thank Tianbao Yang and Lijun Zhang for all the discussions we had and the fun moments we spent on doing research and attending conferences. The results of Chapter 5 and Chapter 8 represent part of the fruits of these collaborations. I also spent a summer as intern at Microsoft Research working with Ofer Dekel and a summer at NEC Research Labs working with Shenghuo Zhu. I learned a lot from them and would like to express my gratitude for having me as an intern. I also would like to thank Elad Hazan, Satyen Kale, Phil Long, Shai Shalev-Shwartz, and Ohad v Shamir for some helpful email correspondence. Living in East Lansing without my good friends would not have been easy. I want to thank all my friends in the department and outside the department. I wish I could name you all. Last but definitely not least, I want to express my deepest gratitude to my beloved parents and my dearest siblings. Their love and unwavering support have been crucial to my success, and a constant source of comfort and counsel. Special thanks to my parents for abiding by my absence in last five years. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . 1.1 Classical Statistical Learning . . . . . . . . . . . . . 1.1.1 Smoothness and sample complexity . . . . . 1.1.2 Smoothness and binary excess risk . . . . . 1.2 Sequential Prediction/Game Theoretic Learning . . 1.2.1 Smoothness and regret bounds . . . . . . . . 1.3 Convex Optimization and Learning . . . . . . . . . 1.3.1 Smoothness and convergence rate . . . . . . 1.3.2 Smoothness and projection-free optimization 1.4 Main Contributions . . . . . . . . . . . . . . . . . . 1.4.1 Statistical Learning . . . . . . . . . . . . . . 1.4.2 Sequential Prediction/Online Learning . . . 1.4.3 Stochastic Optimization . . . . . . . . . . . 1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . 1.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 5 5 6 8 9 11 11 12 12 14 16 20 21 Chapter 2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . 2.1 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Statistical Learning Model . . . . . . . . . . . . . . . 2.1.2 Empirical Risk Minimization . . . . . . . . . . . . . . 2.1.3 Surrogate Loss Functions and Statistical Consistency 2.1.4 Convex Learning Problems . . . . . . . . . . . . . . . 2.2 Sequential Prediction/Online Learning . . . . . . . . . . . . 2.2.1 Mistake Bound Model and Regret Analysis . . . . . . 2.2.2 Online Convex Optimization and Regret Bounds . . . 2.2.2.1 Online Gradient Descent . . . . . . . . . . . 2.2.2.2 Follow The Perturbed Leader . . . . . . . . 2.2.2.3 Follow The Regularized Leader . . . . . . . 2.2.2.4 Online Mirror Descent . . . . . . . . . . . . 2.2.2.5 Online Newton Step . . . . . . . . . . . . . 2.2.3 Variational Regret Bounds . . . . . . . . . . . . . . . 2.2.4 Bandit Online Convex Optimization . . . . . . . . . 2.2.5 From Regret to Risk Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 23 27 30 33 35 36 41 43 45 47 49 51 52 54 57 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Oracle Complexity of Optimization . . . . . . . . . . . . . 2.3.2 Deterministic Convex Optimization . . . . . . . . . . . . . 2.3.2.1 Gradient Descent Method . . . . . . . . . . . . . 2.3.2.2 Accelerated Gradient Descent Method . . . . . . 2.3.2.3 Mirror Descent Method . . . . . . . . . . . . . . 2.3.2.4 Mirror Prox Method . . . . . . . . . . . . . . . . 2.3.2.5 Conditional Gradient Descent Method . . . . . . 2.3.3 Stochastic Convex Optimization . . . . . . . . . . . . . . . 2.3.4 Convex Optimization for Learning Problems . . . . . . . . 2.3.5 From Stochastic Optimization to Convex Learning Theory Chapter 3 Passive Learning with Target Risk 3.1 Setup and Motivation . . . . . . . . . . . . . 3.2 The Curse of Stochastic Oracle . . . . . . . 3.3 The ClippedSGD Algorithm . . . . . . . . . 3.3.1 The Algorithm Description . . . . . . 3.3.2 Main Result on Sample Complexity . 3.4 Analysis of Sample Complexity . . . . . . . 3.5 Proofs of Sample Complexity . . . . . . . . 3.5.1 Proof of Lemma 3.6 . . . . . . . . . . 3.5.2 Proof of Lemma 3.7 . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . 3.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 60 62 63 65 66 68 70 71 74 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . 80 . 82 . 85 . 85 . 89 . 92 . 97 . 97 . 98 . 101 . 102 Chapter 4 Statistical Consistency of Smoothed Hinge 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Classification Calibration and Surrogate Risk Bounds 4.3 Binary Excess Risk for Smoothed Hinge Loss . . . . . 4.3.1 ψ-Transform for Smoothed Hinge Loss . . . . 4.3.2 Bounding E(h) based on Eϕ (h) . . . . . . . . 4.4 A Unified Analysis . . . . . . . . . . . . . . . . . . . 4.4.1 Bounding Smooth Excess Convex Risk Eϕ (h) . 4.4.2 Bounding Binary Excess Risk E(h) . . . . . . 4.5 Proofs of Statistical Consistency . . . . . . . . . . . . 4.5.1 Proof of Theorem 4.5 . . . . . . . . . . . . . . 4.5.2 Proof of Theorem 4.6 . . . . . . . . . . . . . . 4.5.3 Proof of Theorem 4.11 . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 106 110 112 113 114 116 117 123 124 124 126 127 127 Chapter 5 Regret Bounded by Gradual Variation 5.1 Variational Regret Bounds . . . . . . . . . . . . . 5.2 Gradual Variation and Necessity of Smoothness . 5.3 The Improved FTRL Algorithm . . . . . . . . . . 5.4 The Online Mirror Prox Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 131 133 140 144 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 148 149 151 153 155 158 158 163 165 171 175 176 Chapter 6 Gradual Variation for Composite Losses . . . . 6.1 Composite Losses with a Fixed Non-smooth Component . 6.1.1 A Simplified Online Mirror Prox Algorithm . . . . 6.1.2 A Gradual Variation Bound for Online Non-Smooth 6.2 Composite Losses with an Explicit Max Structure . . . . . 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 180 180 183 184 195 Chapter 7 Mixed Optimization for 7.1 Motivation . . . . . . . . . . . . 7.2 The MixedGrad Algorithm . . . 7.3 Analysis of Convergence Rate . 7.4 Proofs of Convergence Rate . . 7.4.1 Proof of Lemma 7.7 . . . 7.5 Summary . . . . . . . . . . . . 7.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 197 200 205 211 211 216 217 Convex Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 220 222 226 231 235 236 237 . . . . . . 239 240 244 246 249 251 5.5 5.6 5.7 5.8 5.4.1 Online Mirror Prox Method with General Norms . . . . . . 5.4.2 Online Linear Optimization . . . . . . . . . . . . . . . . . . 5.4.3 Prediction with Expert Advice . . . . . . . . . . . . . . . . . 5.4.4 Online Strictly Convex Optimization . . . . . . . . . . . . . 5.4.5 Gradual Variation Bounds which Hold Uniformly over Time Bandit Online Mirror Prox with Gradual Variation Bounds . . . . . Proofs of Gradual Variation . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Proof of Theorem 5.5 . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Proof of Corollary 5.12 . . . . . . . . . . . . . . . . . . . . . 5.6.4 Proof of Theorem 5.13 . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . Smooth Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 8 Mixed Optimization for Smooth and Strongly 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 The Epoch Mixed Gradient Descent Algorithm . . . . . . 8.3 Analysis of Convergence Rate . . . . . . . . . . . . . . . 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 9 Efficient Optimization with Bounded Projections 9.1 Setup and Motivation . . . . . . . . . . . . . . . . . . . . . . 9.2 Stochastic Frank-Wolfe Algorithm . . . . . . . . . . . . . . . 9.3 Stochastic Optimization with Single Projection . . . . . . . 9.3.1 General Convex Functions . . . . . . . . . . . . . . . 9.3.2 Strongly Convex Functions with Smoothing . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 9.5 9.6 9.7 9.8 Analysis of Convergence Rate . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Convergence Rate for General Convex Functions . . . . . . . . 9.4.2 Convergence Rate for Strongly Convex Functions . . . . . . . Online Optimization with Soft Constraints . . . . . . . . . . . . . . . 9.5.1 An Impossibility Theorem . . . . . . . . . . . . . . . . . . . . 9.5.2 An Online Algorithm with Vanishing Violation of Constraints 9.5.3 An Online Algorithm without Violation of Constraints . . . . Proofs of Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Proof of Lemma 9.10 . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Proof of Lemma 9.11 . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Proof of Lemma 9.13 . . . . . . . . . . . . . . . . . . . . . . . 9.6.4 Proof of Lemma 9.14 . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 254 257 260 261 267 273 278 278 279 281 282 283 284 . . . . . . . . . . . . . . . . . . . 288 . . . . . . . . . . . . . . . . . 288 LIST OF TABLES Table2.1 Lower bound on the oracle complexity for stochastic/deterministic first-order optimization methods. Here ρ, α, and β are the Lipschitzness, strong convexity, and smoothness parameters, respectively. The parameter κ is the condition number of function and is defined as κ = β/α. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table8.1 The optimal iteration complexity of convex optimization. L and λ are the moduli of smoothness and strongly convexity, respectively. κ = L/λ is the conditional number. . . . . . . . . . . . . . . . . . . 221 Table8.2 Experimental results on Adult data set . . . . . . . . . . . . . . . . 233 Table8.3 Experimental results on RCV1 data set . . . . . . . . . . . . . . . . 233 Table8.4 The testing accuracy on RCV1 and Adult data sets. . . . . . . . . . 234 Table8.5 The computational complexity for minimizing (1/n) xi ∑n i=1 fi (w) . . 236 LIST OF FIGURES Figure 2.1 Figure 2.2 Figure 2.3 Figure 5.1 Illustrations of the 0-1 loss function, and three surrogate convex loss functions: hinge loss, logistic loss, and exponential loss as scalar functions of y⟨w, x⟩. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Reduction of general online convex optimization problem to online optimization with linear functions. . . . . . . . . . . . . . . . . . . . 47 Reduction of bandit online convex optimization to online convex optimization with full information. The full OCO needs to play from a shrinked domain (1 − ξ)W to ensure that the sampled points belong to the domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Illustration of the main idea behind the proposed improved FTRL and online mirror prox methods to attain regret bounds in terms of gradual variation for linear loss functions. The learner plays the ˆ t instead of wt to suffer less regret when the consecutive decision w loss functions are gradually evolving. . . . . . . . . . . . . . . . . . 140 xii LIST OF ALGORITHMS Algorithm 1 ClippedSGD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Algorithm 2 Linearalized Follow The Regularized Leader for OCO . . . . . . . . . . . . . . . 134 Algorithm 3 Improved FTRL (IFTRL) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Algorithm 4 Online Mirror Prox (OMP) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Algorithm 5 Online Mirror Prox Method for General Norms . . . . . . . . . . . . . . . . . . . . . 146 Algorithm 6 Deterministic Online Bandit Convex Optimization . . . . . . . . . . . . . . . . . . 157 Algorithm 7 A Simplified General Online Mirror Prox Method . . . . . . . . . . . . . . . . . . . 182 Algorithm 8 Online Mirror Prox Method with a Fixed Non-Smooth Component . . 184 Algorithm 9 Online Mirror Prox Method with an Explicit Max Structure. . . . . . . . .187 Algorithm 10 MixedGrad Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Algorithm 11 Epoch Mixed Gradient Descent (EMGD) Algorithm . . . . . . . . . . . . . . . 225 Algorithm 12 SGD with ONE Projection by Primal Dual Updating (SGD-PD) . . . 247 Algorithm 13 SGD with ONE Projection by a Smoothing Technique (SGD-ST) . . 253 Algorithm 14 Online Gradient Descent with Soft Constraints. . . . . . . . . . . . . . . . . . . . .269 Chapter 1 Introduction In machine learning the goal is to learn from labeled examples in order to predict the labels of unseen examples. That is, given a training set, we aim to learn a hypothesis, or a classifier that assigns labels to samples that have never been observed by the algorithm. Efficiently finding a hypothesis based on the training set which minimizes some measure of performance is the main focus of machine learning. In order to study the learning problem is a mathematical framework, it is necessary to define the framework in which the algorithm is to function. Basically there are two frameworks that have gained significant popularity within the last two decades: the statistical learning framework and the sequential prediction or online learning framework. In both settings mathematicical optimization theory plays an important role by providing a unified framework to investigate the computational issues of learning algorithms. Additionally, tools from convex optimization underline the analysis in algorithmic machine learning that targets most of the practical learning algorithms in both frameworks. This chapter is devoted to an overview of these three broad topics of statistical learning, sequential prediction/online learning, and convex optimization, aiming to develop a general correspondence between the first two and convex optimization. In particular by characterizing sample complexity, statistical consistency, regret analysis, and convergence rate in terms of the properties of loss functions such as Lipschitzness, strong convexity, and smoothness, 1 we elaborate the importance of smoothness 1 and explicate its consequences. Here we move towards the definitions in a fairly non-technical manner and the formal definitions will be given in Chapter 2. 1.1 Classical Statistical Learning We begin by stating the basic problem of binary classification in the standard passive supervised learning setting (also called batch learning). In binary classification, the learning algorithm is given a set of labeled examples S = ((x1 , y1 ), · · · , (xn , yn )) drawn independent and identically distributed (i.i.d.) from a fixed but unknown distribution D over the space Ξ = X × Y, where X is the instance space and Y is the label (target) space. The goal, with the help of provided labeled examples, is to output a hypothesis or classifier h from a predefined hypothesis class H = {h : X → Y} that does well on unseen examples coming from the same distribution. In other words, we would like to find a hypothesis to generalize well from the training set to the entire domain of examples. To measure the performance of a classifier h ∈ H on unseen samples, we utilize a loss function ℓ : H × Ξ → R+ . The mostly used loss function in binary classification problem with instance space defined as Ξ = Rd × {−1, +1} is the 0-1 loss ℓ(h, (x, y)) = I[h(x) ̸= y], where I[·] is the indicator function. The risk of a particular classifier h ∈ H is the probability that the classifier does not predict the correct label on a random data point generated by the underlying distribution D, i.e., LD (h) = P(x,y)∼D [h(x) ̸= y] = E(x,y)∼D [ℓ(h, (x, y))]. This 1 The precise definition will be given later. We say that a continuously differentiable function f : Rd → R is β-smooth if its gradient is Lipschitz with constant β, i.e., |∇f (w) − ∇f (w′ )| ≤ β∥w − w′ ∥. 2 performance measure is also called generalization error or true risk in statistical learning community. An equivalent way to express the generalization bound is the sample complexity analysis. Roughly speaking, the sample complexity of an algorithm is the number of examples which is sufficient to ensure that, with probability at least 1 − δ (w.r.t. the random choice of S), the algorithm picks a hypothesis with an error that is at most ϵ from the optimal one. The difference between the risk of a particular classifier h and of the optimal classifier h∗ = arg minh∈H LD (h) is called the excess risk of h, i.e., E(h) = LD (h) − LD (h∗ ). Since the underlying distribution D is unknown to the learner, it is impossible for learner to directly minimize the generalization error or true risk. Therefore, one has to resort to using the training data in S to estimate the probabilities of error for a classifiers in H. This alternative approach is known as the Empirical Risk Minimization (ERM) method and aims to pick a hypothesis which has small error on the training set, i.e., empirical risk. The performance of the empirical risk minimization has been throughly investigated and well understood using tools from empirical process theory. It is a well stablished fact that a problem is learnable with ERM method if and if only if the empirical error for all hypothesis in H converges uniformly to the true risk. Furthermore, the uniform convergence holds if the complexity of hypothesis class H satisfies some combinatorial characteristics. This is one of the main achievements of statical learning theory to characterize, and establish necessary and sufficient conditions for the learnability of learning problems using the ERM rule. While the ERM method is theoretically appealing, from a practical point of view one would like to consider problems that are efficiently learnable which is referred to as computational complexity of learning algorithm. This issue becomes more important by noting the fact that in many cases ERM approach suffers from substantial problems such as computational requirements in minimizing 0-1 loss over training set. Indeed, solving the ERM 3 problem for 0-1 loss function is known to be an NP-hard problem. Consequently, it is natural to consider loss functions that act as surrogates for the non-convex 0-1 loss, and lead to practical algorithms. Of course, such a surrogate loss must be reasonably related to the original binary loss function since otherwise this approach fails. For classification problem, good surrogate loss functions have been recently identified, and the relationship between the excess classification risk and the excess risk of these surrogate loss functions has been exactly described. An important family of learning problems that can be learnt efficiently are called Convex Learning Problems. In general, a convex learning problem is a setting in which the surrogate loss function and the hypothesis space H are both convex. This setting encompasses an enormous variety of well-know practical learning algorithms such as regression, support vector machines (SVMs), boosting, and logistic regression, where these algorithms differ in the type of the convex loss function being used as the surrogate of the 0-1 loss. Interestingly, for convex learning problems the ERM rule, of minimizing the empirical convex loss over a convex domain H, becomes a convex optimization problem, making an intimate connection between machine learning and mathematical optimization. Learnability in this setting departs from the learnability via ERM method and strongly depends on the characteristics of convex domain such as boundedness and the analytical properties (curvature) of loss function such as Lipschitzness, smoothness (i.e, differentiable with Lipschitz gradients), and strong convexity (i.e., at any point one can find a convex quadratic lower bound for the function). Beyond learnability, the sample complexity of learning algorithms can also be characterized in terms of the analytical properties of loss function. Therefore, smoothness and strong convexity of convex surrogate loss function play a crucial role in characterizing learnability and analysis of sample complexity of convex learning problems. 4 1.1.1 Smoothness and sample complexity While the main focus of statistical learning theory was on understanding learnability and sample complexity by investigating the complexity of hypothesis class in terms of known combinatorial measures under uniform convergence property, recent advances in online learning and optimization theory opened a new trend in understanding the generalization ability of learning algorithms in terms of the characteristics of loss functions being used in convex learning problems. In particular, a staggering number of results have focused on strong convexity of loss function and obtained better generalization bounds which are referred to as fast rates. In terms of smoothness of loss function, recently it has shown that under this assumption, it is possible to obtain optimistic rates (in the sense that smooth losses yield better generalization bounds when the problem is easier) which are more appealing than the case the convex surrogate loss is Lipschitz continuous. This motivates us to take a step forward in this direction and investigate the smoothness of loss functions in more depth. 1.1.2 Smoothness and binary excess risk As noted above, the convex surrogates of the 0-1 loss are highly preferred because of the computational and theoretical virtues that convexity brings in. Since the choice of convex surrogates could significantly affect the binary excess risk, the relation between risk bounds in terms of binary 0-1 loss and it is corresponding convex surrogate has been the focus of learning community over the last decade. It was delivered that the binary excessive risk can be upper bound by the convex excessive risk through a transform function that only depends on the surrogate convex loss. Although a great deal of work has been devoted to understanding the relation between 5 binary excess risk and convex excess risk, there remain a variety of open problems. In particular, this transformation is well understood under mild conditions such as convexity and it is unclear how other properties of convex surrogates such as smoothness may affect this relation. This becomes more critical if we consider smooth surrogates as witnessed by the fact that the smoothness is further beneficial both computationally- by attaining an optimal convergence rate for optimization error, and in a statistical sense- by providing an improved optimistic rate for generalization bound. Given these positive news of using smooth convex surrogates, an open research question is how the smoothness of a convex surrogate will affect the binary excess risk. So we are thrived to investigate the impact of the smoothness of a convex loss function on transforming the excess risk in terms of the convex surrogate loss into the binary excess risk. 1.2 Sequential Prediction/Game Theoretic Learning An alternative paradigm to analyze the learning problems is the sequential or online learning that can be phrased as a repeated two-player game between the learner and the adversary, making an intimate connection between learning and adversarial game theory. In sequential prediction/online learning framework, the learner is faced with a sequence of samples appearing at discrete time intervals and is required to make predictions sequentially. In contrast to statistical setting, in which, the data source is typically assumed to be i.i.d. with an unknown distribution, in the online framework we relax or eliminate any stochastic assumptions imposed on the samples and they might be chosen adversarially. As a result, online learning framework is better suited for adversarial and interactive learning tasks such as spam email detection and stock market prediction where decisions of the learner could 6 negatively affect future instances the learner receives. By dropping the statistical assumptions on the observed sequence, it is not immediately clear how the prediction problem can be made meaningful and which goals are reasonable. One popular possibility is to measure the performance of the learner by the loss he/she has accumulated during the learning process and compare it to the loss of best fixed solution. The cumulative loss suffered on a sequence of rounds is the sum of instantaneous losses suffered on each one of the rounds in the sequence. In particular, the goal becomes to minimize the gap between the cumulative loss of the online learner and the loss of a strategy that selects the best action fixed in hindsight. This performance gap is called the regret. The analysis of regret mainly focuses on investigating how the regret depends on the length of the time horizon the game proceeds. We note that the best fixed action is chosen form a comparator class of predictors against which the learner will be compared and can only be computed in full knowledge of the sequence of loss functions. Regret analysis stands in stark contrast to the statistical framework in which the learner is evaluated based on his/her accuracy after seeing all training examples, making the online learning setting inherently harder. The theoretical utility of online learning has long been appreciated. More recently, it has become the mainstay of optimization, where it serves as computational platform from which a variety of large-scale learning problems can be solved. The analogous of statistical learnability in online setting is referred to as Hannan consistency. A hypothesis class is learnable in online setting, i.e., Hannan consistent, if for any sequence of samples, there exists an algorithm which attains sub-linear regret in terms of number of rounds the interaction proceeds. Interestingly, unlike statistical learning theory, the analysis of online learning is mostly algorithmic where efficient algorithms are proposed to solve the learning problem and its performance is analyzed to guarantee the Hannan consistency. 7 Recently tools from convex optimization made it possible to capture many online learning problems under a generic problem template, and in many circumstances obtain improved regret bounds. This unified framework, which is referred to as Online Convex Optimization, assumes that the learner is forced to make decisions from a convex set and the adversary is supposed to play convex functions. Additionally, it has been demonstrated that the curvature of convex loss functions played by the adversary such as strong convexity gives a great advantage to the player to attain better regret bounds. Surprisingly, tools from online optimization also provided insights to get better convergence rates or more efficient algorithms for some stochastic and deterministic convex optimization problems. 1.2.1 Smoothness and regret bounds Unlike strong convexity, the smoothness of loss functions is not a desirable property in online setting as it yields the same regret bounds as the loss functions being Lipschitz continuous. However, there are scenarios that the smoothness of the sequence of loss functions played by the adversary becomes important. One such scenario is the online learning from loss functions that might have some patterns and not being fully adversarial. For example, the weather condition or the stock price at one moment may have some correlation with the next and their difference is usually small, while abrupt changes only occur sporadically. Therefore devising online convex optimization algorithms which can take into account the gradual behavior of the environment and at the same time protect against the worst case sequences would be more desireable. In terms of regret analysis, this translates to having algorithms with regret bounded in terms of variation of loss functions instead of time horizon that is main measure in the standard setting of sequential prediction. In these evolving settings, the smoothness of loss function becomes critical. More importantly, no gradual variation 8 bound is achievable if the loss functions are no longer smooth. This necessitates the need to develop online methods that exploit the smoothness assumption in the learning process or in the analysis to obtain improved regret bounds in terms of variation of the sequence of loss functions and underlines our motivation in this thesis. 1.3 Convex Optimization and Learning In the problem of convex optimization, we are interested in minimizing a given convex function f : Rd → R form a predefined family of convex functions F over a convex set W ⊆ Rd . The goal is to find an approximate solution with an accuracy ϵ, i.e., finding a w ∈ W where f (w) − minw∈W f (w) ≤ ϵ. A typical optimization algorithm initially chooses a point from the feasible convex set w0 ∈ W and iteratively updates these points based on some information about the function at hand until it achieves the desired accuracy. To capture the efficiency of an optimization procedure, we follow the black-box model of optimization 2 . In this model we assume that there exists an oracle which provides information about the query points such as function value, gradient, and second gradient (i.e, Hessian). The number of queries issued to an oracle to find a solution with a predefined level of accuracy is called oracle complexity when it is stated in terms of desired accuracy ϵ or equivalently convergence rate when it is stated in terms of the number of queries. As already mentioned, learning problems under both statistical and online learning frameworks can be directly formulated as optimization problems. In statistical setting and espe2 As indicated by Yurii Nesterov in his seminal book [121], in general, optimization problems are unsolvable and we need to relax the goal to make it reachable. 9 cially convex learning problems, the learning algorithm corresponds to the optimization algorithm that solves the minimization problem of picking a hypothesis from the set of hypotheses that minimizes empirical loss over training sample. Similarly, in the online convex optimization, the online learner iteratively chooses decisions from a closed, bounded and non-empty convex set and encounters convex cost functions. Formulating and investigating both statistical and online learning problems in the context of convex optimization makes an intimate connection between learning and mathematical optimization. Therefore, the study of fast iterative methods for approximately solving convex programming problems is a central focus of research in convex optimization, with important applications in machine learning, and many other areas of computer science. The usefulness of convex optimization in the development of various learning algorithms is well established in the past several years. Additionally, challenges exist in machine learning applications demand the development of new optimization algorithms. In optimization for supervised machine learning and in particular the empirical risk minimization paradigm with convex surrogates and gradient information, there exist two regimes in which popular algorithms tend to operate: deterministic (also known as batch optimization or full gradient method) regime in which whole training data are used to compute the gradient at each iteration and stochastic regime which samples a fixed number of training samples per iteration, typically a single training sample, to compute the gradient at each iteration. Although stochastic optimization methods suffer from the low convergence rate in comparison to batch methods, the lightweight computation per iteration makes them attractive for many large-scale learning problems. Hence, with the increasing amount of data that is available for training, stochastic convex optimization has emerged as the most scalable approach for large-scale machine learning which is known to yield moderately accurate 10 solutions in a relatively short time. We emphasize that the role of convex optimization goes beyond computational issues and it also provides tools to characterize the learnability in convex learning problems via efficient stochastic optimization algorithms for learning these problems. Analogous to both statistical and online learning frameworks, the curvature of function to be optimized, significantly affects the convergence rate of optimization methods. Perhaps the most extensively studied are strong convexity and smoothness of function. 1.3.1 Smoothness and convergence rate Exploiting smoothness of loss function, in particular in stochastic optimization, to obtain better convergence rate has been one the main research challenges in recent years. Despite enormous advances in exploiting smoothness in deterministic optimization, it has not been utilized in stochastic optimization. In particular, stochastic optimization of smooth loss functions exhibits the same convergence rate as stochastic optimization under Lipschitzness assumption of function. Therefore, this thesis is motivated by the need of developing stochastic optimization algorithms with better convergence rates under smoothness assumption. The key question is whether or not smoothness property of loss functions could be leveraged to develop much faster stochastic optimization methods. 1.3.2 Smoothness and projection-free optimization At the core of many iterative constrained optimization algorithms in both online and stochastic convex optimization is a projection step to ensure the feasibility of solutions for intermediate iterations. This is a serious deficiency, since in many applications the projection 11 onto the constrained domain might be computationally expensive and sometimes as hard as solving the original optimization problem. It is therefore of considerable interest to devise optimization methods which do not require projection steps or need a bounded number of projection operations. At it will became clear later in this thesis, by smoothing a strongly convex objective function, we are able to reduce the number of projections into a single projection at the end of the optimization process. In contrast to the other parts of the thesis where we assume and exploit the smoothness, this is the only result that injects and leverages the smoothness to gain from the merits of smoothness to be able to devise more efficient algorithms. 1.4 Main Contributions In this section we shall elaborate on the main problems considered in this thesis and our key contributions to address these problems. A common theme in all of the algorithms is that they exploit the smoothness of loss function for more efficient methods. 1.4.1 Statistical Learning • Logarithmic sample complexity for learning from smooth and strongly convex losses with target risk. The first problem we consider in this thesis has a statistical nature. In particular, we consider learning in passive setting but with a slight modification. We assume that the target expected loss, also referred to as target risk, is provided in advance for learner as prior knowledge. Unlike most studies in the learning theory that only incorporate the prior knowledge into the generalization bounds, we are able to explicitly utilize the target risk in the learning process. 12 Our analysis reveals a surprising result on the sample complexity of learning: by exploiting the target risk in the learning algorithm, we show that when the loss function ( ( )) is both smooth and strongly convex, the sample complexity reduces to O log 1ϵ , an exponential improvement compared to the sample complexity O( 1ϵ ) for learning with strongly convex loss functions. Unlike the previous works on sample complexity, the proof of our result is constructive and is based on a computationally efficient stochastic optimization algorithm which makes it practically interesting. The proposed ClippedSGD algorithm uses knowledge of the target risk to appropriately clip gradients obtained from a stochastic oracle. The clipping is beneficial because it reduces the variance in stochastic gradients and makes it possible to reduce the sample complexity. This happens under the assumption that the loss function is smooth and strongly convex. • Statistical consistency of smoothed hinge loss. The second problem we address in statistical learning setting is to investigate the relation between the excess risk that can be achieved by minimizing the empirical binary risk and the excess risk of smooth convex surrogates. As mentioned earlier, convex surrogates of the 0-1 loss are highly preferred because of the computational and theoretical virtues that convexity brings in. This is of more importance if we consider smooth surrogates as witnessed by the fact that the smoothness is further beneficial both computationally- by attaining an optimal convergence rate for optimization, and in a statistical sense- by providing an improved optimistic rate for generalization bound. However, we investigate the smoothness property from the viewpoint of statistical consistency and show how it affects the binary excess risk 13 for smoothed hinge loss. In particular, we intend to answer the following fundamental questions: ”How does the smoothness of surrogate convex loss affect the binary excess risk? Considering the advantages of smooth losses in terms of optimization and generalization, is it beneficial or detrimental in terms of statistical consistency? Under what conditions on these three types of errors it is better to use smooth losses? ” We show that in contrast to optimization and generalization errors that favor the choice of smooth surrogate loss, the smoothness of loss function may deteriorate the binary excess risk. Motivated by this negative result, we provide a unified analysis that integrates optimization error, generalization bound, and the error in translating convex excess risk into a binary excess risk when examining the impact of smoothness on the binary excess risk. We show that under favorable conditions appropriate choice of smooth convex surrogate loss will result in a binary excess risk that is better than √ O(1/ n) which is unimprovable for general non-smooth Lipschitz losses. 1.4.2 Sequential Prediction/Online Learning • Regret bounded by gradual variation for smooth online convex optimization. As our third problem, we study the online convex optimization problem under the assumption that even the loss functions are arbitrary, but there is a hidden pattern that can be exploited in learning process. Therefore, an interesting question that inspires our work in the analysis of online learning algorithms is the following: ”Can we have online algorithms that can take advantage of benign sequences and at the same time protect against the adversarial sequences? ” 14 To answer this question, we introduce the gradual variation, measured by the sum of the distances between every two consecutive loss functions, to asses the performance of online learning algorithms in gradually evolving environments such as stock prediction. We propose two novel algorithms, an Improved Follow the Regularized Leader (IFTRL) algorithm and an Online Mirror Prox (OMP) method, that achieve a regret bound which only scales as the square root of the gradual variation for the linear and general smooth convex loss functions. To establish the main results, we discuss a lower bound for online gradient descent, and a necessary condition on the smoothness of the cost functions for obtaining a gradual variation bound. For the closely related problem of prediction with expert advice, we show that an online algorithm modified from the multiplicative update algorithm can also achieve a similar regret bound for a different measure of deviation. Finally, for loss functions which are strongly convex in applications such as portfolio management problem, we show a regret which is only logarithmic in terms of the gradual variation. The gradual variation- in addition to its intrinsic interest as an extension of regret analysis- has several specific consequences. First, since gradual variation lower bounds the regret bound, devising algorithm with small gradual variation is also guarantees to achieve small regret. Second, algorithms with small gradual variation are specifically designed to attain small variation, therefore they can capture the correlation between loss functions if it exist and boost the performance. • Gradual variation for composite online convex optimization. As an impossibility result for obtaining gradual variation for convex losses, we show that for nonsmooth functions when the only information presented to the learner is the first order 15 information about the cost functions, it is impossible to obtain a regret bounded by gradual variation. However, we show that a gradual variation bound is achievable for a special class of non-smooth functions that is composed of a smooth component and a non-smooth component. We consider two categories for the non-smooth component. In the first category, we assume that the non-smooth component is a fixed function and is relatively easy such that the composite gradient mapping can be solved without too much computational overhead compared to gradient mapping. In the second category, we assume that the non-smooth component can be written as an explicit maximization structure. In general, we consider a time-varying non-smooth component, present a primal-dual prox method, and prove a min-max regret bound by gradual variation. When the non-smooth components are equal across all trials, the usual regret is bounded by the min-max bound plus a variation in the non-smooth component. 1.4.3 Stochastic Optimization • Improved convergence rate for stochastic optimization of smooth losses. We then turn to exploiting smoothness in stochastic optimization. Recently stochastic optimization methods have experienced a renaissance in the design of fast algorithms for large-scale learning problems. Unlike the optimization methods based on full gradients, the smoothness assumption was not exploited by most of stochastic optimization methods. More importantly, for general Lipschitz continuous convex functions, simple stochastic optimization methods such as stochastic gradient descent exhibit the same convergence rate as that for the smooth functions, implying that smoothness of 16 the loss function is essentially not very useful and can not be exploited in stochastic optimization. Therefore, by noting this significant gap between the convergence rate of optimization of smooth functions in stochastic and deterministic optimization, the natural question arises is: ”Can smoothness property of function be exploited to speed up the convergence rate of stochastic optimization of smooth functions? ”. We will provide an affirmative answer to this question. In particular, we propose a novel optimization paradigm which interpolates between stochastic and full gradient methods and is able to exploit the smoothness of loss functions in optimization process to obtain faster rates. The results show an intricate interplay between stochastic and deterministic convex optimization. The MixedGrad algorithm we propose fits in the mixed optimization paradigm and is an alternation of deterministic and stochastic gradient steps, with different of frequencies for each type of steps. We show that it attains an O(1/T ) convergence rate for smooth losses. • Condition number independent accesses of full gradient oracle for smooth and strongly convex optimization. The optimal iteration complexity of the gradi√ ent based algorithm for smooth and strongly convex objectives is O( κ log 1/ϵ), where κ is the conditional number (the ratio of strong convexity to smoothness parameters). Despite its linear convergence rate in terms of target accuracy ϵ, in the case that the optimization problem is ill-conditioned, we need to evaluate a larger number of full gradients, which could be computationally expensive. Therefore, a natural question is: ”Can we manage the dependency on the condition number and devise optimization methods independent of condition number in accessing the full gradient oracle? ” 17 We show that in the mixed optimization regime introduced in this thesis, we may also leverage the smoothness assumption of loss functions to devise algorithms with iteration complexities that are independent of condition number in accessing the full gradient oracle. We utilize the idea of mixed optimization in progressively reducing the variance of stochastic gradients to optimize smooth and strongly convex functions, and propose the Epoch Mixed Gradient Descent (EMGD) algorithm that is independent of condition number in accessing the full gradients. Similar to the MixedGrad algorithm, a distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full gradient and the stochastic gradient to update the intermediate solutions. By performing a fixed number of mixed gradient descents, we are able to improve the suboptimality of the solution by a constant factor, and thus achieve a linear convergence rate. Theoretical analysis shows that EMGD is able to find an ϵ-optimal solution by computing O(log 1/ϵ) full gradients and O(κ2 log 1/ϵ) stochastic gradients. We also provide experimental evidence complementing our theoretical results for classification problem on few medium-sized data sets. • Efficient projection-free online and stochastic convex optimization. Another problem we address in this thesis is efficient projection-free optimization methods for stochastic and online convex optimization. Our motivation stems from the observation that most of the gradient-based optimization algorithms require a projection onto the convex set W from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets this is the main computational bottleneck and may be inefficient in practice. For instance, for many applications in machine learning such as metric learning, the convex domain is the 18 positive semidefinite cone for which is the projection step requires a full eigendecomposition. For many other problems, the projection step is itself an offline optimization problem and might be as hard as solving the original optimization problem. This observation immediately leads to the following question that inspires our work: ”To what extent it is possible to reduce the number of expensive projection steps in online and stochastic optimization? Can we trade expensive projection steps off for other types of light computational operations?.” We consider this problem in two settings: stochastic optimization and online convex optimization. In stochastic setting, we develop novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, the proposed algorithms achieve an √ O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. The key insight which underlines the proposed projection-free algorithm for strongly convex functions is smoothing the objective function. This is in contrast to other problems in this thesis where we try to leverage the smoothness of objective, while here we introduce smoothness to gain from its computational virtues in alleviating the projection steps. In online setting, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to a constrained convex domain for all rounds, we only require that the constraints, which define the convex set, be satisfied in the long run. By turning the problem into an online convex-concave optimization prob- 19 √ lem, we propose an efficient algorithm which achieves an O( T ) regret bound and an O(T 3/4 ) bound for the violation of constraints. Then we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. We also prove an impossibility result which shows that simple ideas such as augmenting the objective function with penalized constraints fail to solve the problem and results in a linear bound O(T ) for either the regret or the violation of the constraints. 1.5 Thesis Overview The remainder of this thesis is organized as follows. Chapter 2 lays out the foundation for the rest of the thesis. In particular, we provide a survey of some of the background material from statistical learning, sequential prediction/online learning theory, and as well as convex optimization. It will become clear in this chapter that there exist deep connections between these three areas. The first part of the thesis focuses on the statistical learning, investigating the sample complexity of learning when the target risk is known to the learner and the consistency of smoothed hinge loss. In Chapter 3 we focus on statistical learning with target risk under the assumption that the loss function is smooth and strongly convex. Chapter 4 investigates the consistency of smoothed hinge loss and provides negative and positive results on transforming its excess risk to a binary excess risk. The second part of the thesis is on sequential prediction/online learning and introduces the gradual variation measure to asses the performance of online convex optimization algorithms in gradually evolving environments. Chapter 5 discusses the necessity of smoothness 20 to obtain regret bounds in terms of gradual variation, followed by two efficient algorithms to obtain gradual variation bounds for smooth online convex optimization problems. The adaption to other settings such as expert advice problem and strongly convex loss functions are also discussed. The extension of results to special composite loss functions with a smooth component is discussed in Chapter 6. The third part of the thesis is devoted to devising efficient stochastic optimization algorithms by leveraging the smoothness of loss functions. We propose the mixed optimization paradigm for stochastic optimization in Chapter 7 and extend to smooth and strongly convex losses in Chapter 8. The stochastic optimization methods with bounded projections and online optimization with soft constraints are elaborated in Chapter 9. Finally, the appendix summarizes rather standard things on convex analysis and concentration inequalities that are used in the proof of results in the thesis and is mainly for reference. In order to facilitate independent reading of various chapters, some of the definitions from convex analysis are even repeated several times. 1.6 Bibliographic Notes Some of the results in this dissertation have appeared in prior publications. The material in Chapter 3 is based on a work published in Conference on Learning Theory (COLT)[102] and the content of Chapter 4 is new [108]. The material in Chapter 5 and Chapter 6 come from [41] which is published at COLT and its extended version has been recently published in Machine Learning journal [150]. The results in Chapter 7 and Chapter 8 follow [107] and [152], respectively, which are appeared in Advances in Neural Information Processing Systems (NIPS). The content of Chapter 9 is mostly compiled from [106], [104], and [105] 21 which are published at NIPS and Journal of Machine Learning Research (JMLR). 22 Chapter 2 Preliminaries The goal of this chapter is to give a gentle and formal overview of the material related to the work has been done in this thesis. In particular, we will discuss key concepts and questions in statistical learning, online learning, and convex optimization and will highlight the role of analytical properties of loss functions such as Lipschitzness, strong convexity, and smoothness in all these settings. The exposition given here is necessarily very brief and the detailed discussion will be provided in the relevant chapters. 2.1 2.1.1 Statistical Learning Statistical Learning Model In a typical supervised statistical learning problem (also known as passive learning or batch learning), we are given an instance space X , and a space Y of labels or target values. Each element in the data domain X represents an object to be classified, e.g., the content of an email in spam detection application or the features of an image in vision applications. The target space Y can be either discrete Y = {−1, +1}, as in the case of classification, or continuous Y = R, as in the case of regression. To model learning problems in statistical or probabilistic setting, we assume that the product space Ξ ≡ X × Y is endowed with a probability measure D which is unknown to the learner. However, it is possible to sample 23 an arbitrary number of pairs S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) ∈ Ξn according to the underlying distribution D. We term this set of examples the training set, or the training sample. The existence of distribution D is necessary to ensure that the already collected samples S have something in common with the new and unseen data. A hypothesis or classifier h : X → Y is a function that assigns labels h(x) ∈ Y to any instance x ∈ X such that the assigned label is a good approximation of the possible response y to an arbitrary instance x generated according to the distribution D. In other words, the hypothesis h captures the functional relationship between the input instances and the output, which in turn makes it possible to predict the output value for future input instances. In light of no free lunch theorem [148], learning is impossible unless we make assumptions regarding the nature of the problem at hand. Therefore, when approaching a particular learning problem, it is desirable to take into account some prior knowledge we might have about our problem. In this regard, we assume that the learning algorithm is confined to a predetermined set of candidate hypotheses H = {X → Y} to which we wish to compare the result of our learning algorithm. In a specific learning context, the hypothesis class can represent our beliefs on the true nature of the classification rule for the problem. For example the hypothesis class for binary classification might be a subset of a vector space with bounded norm which represents the linear classifiers, i.e., H = {x → sign(⟨w, x⟩ + b), w ∈ Rd , b ∈ R, ∥w∥ ≤ R}. In order to measure the performance of a learning algorithm, we usually use a loss function ℓ : H × Ξ → R+ . The instantaneous loss incurred by a learning algorithm on instance z = (x, y) ∈ Ξ for picking hypothesis h ∈ H is given by ℓ(h, z). For example, in binary classification problems, Ξ = X × {−1, 1}, H is a set of functions h : X → {−1, 1}, and the loss function is the binary or 0-1 loss defined as ℓ0−1 (h(x), y) = I[h(x) ̸= y], where I[·] is the 24 indicator function that takes value 1 if its argument is true and 0 otherwise. In regression or real classification problems, where the goal is to predict real valued labels Y = R, the common loss function used to evaluate the performance of a regressor h on sample z = (x, y) is the squared loss ℓ(h, z) = (h(x) − y)2 . A classifier is constructed on the basis of the n independent and identically distributed (i.i.d) samples S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) from Ξ. The ultimate goal of a typical learning algorithm is to pick a classifier h ∈ H that is competitive with respect to the best hypothesis from H with respect to the expected risk or generalization error defined as: LD (h) = Ez∼D [ℓ(h, z)], where E[·] denotes the expectation with respect to the (unknown) probability distribution D underlying our samples and ℓ(h, (x, y)) is the binary loss I[h(x) ̸= y] for classification and squared loss (h(x) − y)2 for regression problem. For binary classification the generalization error of a hypotheses h : X → {−1, +1} is simply the probability that it predicts the wrong label on a randomly drawn instance from Ξ, i.e., LD (h) = P(x,y)∼D [h(x) ̸= y]. The difference between the risk of a particular classifier h and of the optimal classifier h∗ = arg minh∈H LD (h) is called the excess risk of h, i.e., E(h) = LD (h) − LD (h∗ ). In designing any typical solution to a supervised machine learning problem, there are few key questions that must be considered. The first of these concerns approximation that characterizes how rich the solution space H is to approximate the true underlying model. The 25 second fundamental issue concerns estimation that characterizes how well the obtained solution performs in making future predictions on unseen data and how much training samples suffices to find the solution. The third key question concerns the computational efficiency that characterizes how efficiently can we make use of the training data to choose an accurate hypothesis. The basic model to analyze learning algorithms in computational learning theory is the Probably Approximately Correct (PAC) model proposed by the pioneering work of Valiant [144]. It applies to learning binary valued functions and uses the 0-1 loss under realizability assumption, i.e., the algorithm gets samples that are consistent with a hypothesis in a fixed class H, ∃ h∗ ∈ H; P(x,y)∼D [h∗ (x) = y] = 1. In PAC model we bound the loss of the algorithm with a high probability over the random draw of samples. A decision-theoretic extension of the PAC framework which is known as agnostic learning is introduced by [87] that generalizes the PAC model to general loss functions and without assuming realizability assumption as defined below: Definition 2.1 (Agnostic PAC Learnability). A hypothesis class H is agnostic PAC learnable with respect to Ξ and a loss function ℓ : H×Ξ → R+ , if there exists a function mH : (0, 1)2 → N and a learning algorithm A with the following property: for every ϵ, δ ∈ (0, 1) and for any distribution D over the domain Ξ, when running the algorithm A on m ≥ mH (ϵ, δ) i.i.d. examples generated by D, the algorithm returns h ∈ H such that, with probability of at least 1 − δ, LD (h) ≤ min LD (h′ ) + ϵ. ′ h ∈H If further A runs in poly(1/ϵ, 1/δ, n), then H is said to be efficiently agnostic PAC-learnable. The goal of the PAC framework is to understand how large a data set needs to be in order 26 to give good generalization. It also provides bounds for the computational cost of learning. In agnostic PAC learnability there are two fundamental questions that need to be addressed carefully: these are computational efficiency and sample complexity. The computational aspect of learning measures the amount of computation required to implement a learning algorithm. The sample complexity of an algorithm is the number of examples which is sufficient to ensure that, with probability at least 1 − δ (w.r.t. the random choice of S), the algorithm picks a hypothesis with an error that is at most ϵ from the optimal one. We note that while computational complexity concerns the efficiency of learning, the sample complexity is a statistical measure and concerns the difficulty of learning from the hypothesis H with respect to the underlying distribution D. An equivalent way to present the sample complexity is to give a generalization bound. It states that with probability at least 1 − δ, to attain a risk LD (h) which departs from the optimal risk minh′ ∈H LD (h′ ) by at most ϵ, is upper bounded by some quantity that depends on the sample size n and δ. Sample complexity of passive learning is well established and goes back to early works in ( ) ( ) 1 1 1 1 1 1 the learning theory where the lower bounds Ω ϵ (log ϵ + log δ ) and Ω 2 (log ϵ + log δ ) ϵ were obtained in classic PAC and general agnostic PAC settings, respectively [52, 26, 7]. It worth emphasizing that the PAC framework is a distribution-free model and we are interested in sample-complexity guarantees that hold regardless of the distribution D from which examples are drawn. 2.1.2 Empirical Risk Minimization Since in the probabilistic setting, we assume that there is an underlying probability distribution D over the sample space X × Y which captures the relationship between the samples given to the algorithm during training and the new instances it will receive in the future; 27 the training examples must be ‘representative’ in some way of the examples to be seen in the future. Clearly, learning is hopeless if there is no correlation between past and present rounds. Note however that the distribution D is not known to the learner; the learner sees the distribution only through the training examples S = (z1 , z2 , · · · , zn ) ∈ Ξn , and based on these examples, must learn to predict well on new instances from the same distribution. Therefore, we cannot compute the generalization error directly and machine learning aims to find estimators based on the observed data samples S . A simple and well-known learning approach is the empirical risk minimization (ERM) method. Basically, the idea of ERM is to replace the unknown true risk LD = P(x,y)∼D [h(x) ̸= y] by its empirical counterpart rooted in the training set S and minimize this empirical risk as defined below: LS (h) = 1 {i : i ∈ [n] and h(xi ) ̸= yi } . n The empirical error LS (h) of any hypothesis h ∈ H is its average error over the training samples in S, while the generalization error LD (h) is its expected error based on a random sample realized by the distribution D. We note that the empirical error is a useful quantity, since it can easily be determined from the training data and it provides a simple estimate of the true error. The empirical loss over the training data provides an estimate whose loss is close to the optimal loss if the class H is sufficiently large so that the loss of the best function in H is close to the optimal loss and is small enough so that finding the best candidate in H based on the data is computationally feasible. In this regard, generalization error bounds 28 give an upper bound on the difference between the true and empirical error of functions in a given class, which holds with high probability with respect to the sampling of the training set. Then, our task becomes to evaluate the expected risk relying on the empirical error. Having this quantity bounded, a learning algorithm may choose the hypothesis that is the most accurate on the sample, and is guaranteed that its loss on the distribution will also be low. The statistical learning theory is concerned with characterizing learnability and providing bounds on the deviations of this estimate from the expected error. One of its main achievements is a complete characterization of the necessary and sufficient conditions for generalization of ERM, and for its consistency. A fundamental answer, formally proven for supervised classification and regression, is that learnability is equivalent to uniform convergence, and that if a problem is learnable, it is learnable via empirical risk minimization. Definition 2.2 (Uniform Convergence). A hypothesis class H has the uniform convergence property with respect to Ξ and the loss function ℓ : H × Ξ → R+ , if for any probability distribution D over Ξ, there exists a function mH : (0, 1)2 → N such that for any sample S of size mH (ϵ, δ) drawn i.i.d based on D, with probability at least 1 − δ, ∀h ∈ H it holds: LS (h) − LD (h) ≤ ϵ. Hence, the crucial step towards proving learnability is to obtain a result on the uniform convergence of sample errors to true errors. Uniform convergence of empirical quantities to their mean provides ways to bound the gap between the expected risk and the empirical risk by the complexity of the hypothesis set. Hence, the complexity of the hypothesis 29 class H is the critical factor in determining the distribution-free sample-complexity of a supervised learning problem. Several complexity measures for hypothesis classes have been proposed, each providing a different type of guarantee including the Vapnik-Chervonenkis (VC) dimension [146] and the Rademacher complexity [15, 91]. The main virtue of the Vapnik-Chervonenkis theorem and Rademacher complexity is that they convert the problem of uniform deviations of empirical averages into a combinatorial and data dependent problems, respectively. We note that uniform convergence arguments is not the only possible way to characterize learnability. Since the first results of Vapnik and Chervonenkis on uniform laws of large numbers for classes of binary valued functions, there has been a considerable amount of work aiming at obtaining generalizations and refinements of these bounds. These techniques include sample compression [55], algorithmic stability [32], and PAC-Bayesian analysis [112] which also have been shown for characterizing learnability and proving generalization bounds. We will also discuss the stochastic optimization machinery [133] to characterize learnability in general settings later in this chapter. 2.1.3 Surrogate Loss Functions and Statistical Consistency Although ERM approach has a lot of theoretical merits, since we should seek to minimize the training error based on 0-1 loss, it typically is a combinatorial problem, leading to NP-hard optimization problem which is not computationally realizable. A common practice to circumvent this difficulty is to revert to minimize a surrogate loss function, i.e., to replace the indicator function by a surrogate function and find the minimizer with respect to this surrogate function. Obviously, the surrogate loss needs to be computationally easy to optimize, while close in some sense to the 0-1 loss. In partic30 ℓ(w, (x, y)) 0-1 loss Hinge loss Logistic loss Exponential loss . y⟨w, x⟩ Figure 2.1: Illustrations of the 0-1 loss function, and three surrogate convex loss functions: hinge loss, logistic loss, and exponential loss as scalar functions of y⟨w, x⟩. ular, if the surrogate function is assumed to be convex, it allows the optimization to be performed efficiently with only modest computational resources. Examples of such surrogate loss functions for 0-1 loss include logistic loss ℓlog (h, (x, y)) = log(1 + exp(−yh(x))) in logistic regression [61], hinge loss ℓhinge (h, (x, y)) = max(0, 1 − yh(x)) in support vector machines (SVMs) [43] and exponential loss ℓexp (h, (x, y)) = exp(−yh(x)) in boosting (e.g., AdaBoost [58]). When the hypothesis class H consists of functions that are linear in a parameter vector w, i.e., linear classifiers, these loss functions are depicted in Figure 2.1. Having defined the surrogate loss functions, then the task is to minimize the relaxed empirical loss in terms of the surrogate losses. However, in practice, the ubiquitous approach to find the solution is the regularized empirical risk minimization which adds a regularization function R(h) to the objective and solves { } n 1∑ hS ∈ arg min LS (h) + R(h) ≡ ℓ(h, zi ) + R(h) . n h∈H i=1 31 (2.1) The goal of introducing regularizer is to prevent over-fitting. Of course, given some training data, it is always possible to build a function that fits exactly the data. But, in the presence of noise, this may lead to a poor performance on unseen instances. An immediate consequence of adding the regularization term is to favor simpler classifiers to increase its generalization capability. Some of the commonly used regularizers in the literature are R(h) = ∥h∥22 , and R(h) = ∥h∥1 . We note that solving the optimization problem in (2.1) is a convex optimization problem for which efficient algorithms exist to find a near optimal solution in a reasonable amount of time. Although the idea of replacing the non-convex 0-1 loss function with convex surrogate loss functions seems appealing and resolves the efficiency issue of the ERM method, but it has statistical consequences that must be balanced against the computational virtues of convexity. The question then is how well does minimizing such a convex surrogate perform relative to minimizing the actual classification error. Statistical consistency concerns this issue. Consistency requires convergence of the empirical risk to the expected risk for the minimizer of the empirical risk together with convergence of the expected risk to the minimum risk achievable by functions in H [14]. An important line of research in statistical learning theory focused on relating the convex excess risk to the binary excess risk. It is known that under mild conditions, the classifier learned by minimizing the empirical loss of convex surrogate is consistent to the Bayes classifier [156, 101, 80, 97, 141, 14]. For instance, it was shown in [14] that the necessary and sufficient condition for a convex loss ℓ(·) to be consistent with the binary loss is that ℓ(·) is differentiable at origin and ℓ′ (0) < 0. It was further established in the same work that the binary excessive risk can be upper bound by the convex excess risk through a ψ-transform that depends on the surrogate convex loss ℓ(·). A detailed elaboration of this issue will be given in Chapter 4 where we examine the 32 statistical consistentency of smooth convex surrogates. 2.1.4 Convex Learning Problems We now turn our attention to convex learning problems where Ξ be an arbitrary measurable set, H be a closed, convex subset of a vector space, and the loss function ℓ(h, z) be convex w.r.t. its first argument. This family of learning encompasses a rich body of existing learning methods for which efficient algorithms exists such as support vector machines, boosting and logistic regression. Convex learning problems makes an important family of learning problems, mainly because most of what we can learn efficiently falls into this family. Before diving into formal definition, we need to familiarize ourselves with the following definitions about convex analysis [28, 121] which will come in handy throughout this dissertation (for standard definitions about convex analysis see Appendix ??). Definition 2.3 (Convexity). A set W in a vector space is convex if for any two vectors w, w′ ∈ W, the line segment connecting two points is contained in W as well. In other fords, for any λ ∈ [0, 1], we have that λw + (1 − λ)w′ ∈ W. A function f : W → R is said to be convex if W is convex and for every w, w′ ∈ W and α ∈ [0, 1], f (λw + (1 − λ)w′ ) ≤ λf (w) + (1 − λ)f (w′ ). A continuously differentiable function is convex if f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ for all w, w′ ∈ W. If f is non-smooth then this inequality holds for any sub-gradient g ∈ ∂f (w′ ). The formal definition of convex learning problems is given below. Definition 2.4 (Convex Learning Problem). A learning problem with hypothesis space H, 33 instance space Ξ = X × Y, and the loss function ℓ : Ξ × H → R+ is said to be convex if the hypothesis class H is a parametrized convex set H = {hw : x → ⟨w, x⟩ : w ∈ Rd , ∥w∥ ≤ R} and for all z = (x, y) ∈ Ξ, the loss function ℓ(·, z) is a non-negative convex function. In the remainder of thesis, when it is clear from the context, we will represent the hypothesis class with W and simply use vector w to represent hw , rather than working with hypothesis hw . We note that for convex learning problems, the ERM rule becomes a convex optimization problem which can be efficiently solved. This stands in sharp contrast to non-convex loss functions such as 0-1 loss for which solving the ERM rule is computationally cumbersome and known to be NP-hard. Obviously, this efficiency comes at a price and not every convex learning problem is guaranteed to be learnable and convexity by itself is not sufficient for learnability. This requires to impose more assumptions on the setting to ensure the learnability of the problem. In particular, it can been shown that if the hypothesis space W is bounded and the loss function is Lipschtiz or smooth as formally defined below, then the convex learning problem is learnable [133, 130]. Definition 2.5 (Lipschitzness). A function f : W → R is ρ-Lipschtiz over the set W if for every w, w′ ∈ W we have that |f (w) − f (w′ )| ≤ ρ||w − w′ ||. Definition 2.6 (Smoothness). A differentiable loss function f : W → R is said to be βsmooth with respect to a norm ∥ · ∥, if it holds that f (w) ≤ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ + 34 β ∥w − w′ ∥2 , ∀ w, w′ ∈ W. 2 (2.2) We note that smoothness also follows if the gradient of the loss function is β-Lipschtiz, i.e., ∥∇f (w) − ∇f (w′ )∥ ≤ β∥w − w′ ∥. Smooth functions arise, for instance, in logistic and least-squares regression, and in general for learning linear predictors where the loss function has a Lipschitz continuous gradient. There has been an upsurge of interest over the last decade in finding tight upper bounds on the sample complexity of convex learning problems by utilizing prior knowledge on the curvature of the loss function, that led to stronger generalization bounds in agnostic PAC setting. In [95] fast rates obtained for squared loss, exploiting the strong convexity of this loss function, which only holds under pseudo-dimensionality assumption. With the recent development in online strongly convex optimization [68], fast rates approaching O( 1ϵ log 1δ ) for convex Lipschitz strongly convex loss functions has been obtained in [83, 140, 82] and for exponentially concave loss functions in [103]. For smooth non-negative loss functions, [138] improved the sample complexity to optimistic rates for non-parametric learning using the notion of local Rademacher complexity [13]. 2.2 Sequential Prediction/Online Learning The statistical model discussed above, first assumes the existence of a stochastic model for generating instances according to the underlying distribution D, and then samples a training set S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) and investigates the ERM strategy to find a hypothesis h ∈ H which generalizes well on unseen instances. Although, this model is valid for cases for which a tractable statistical model reasonably describes the underlying process, but it may be unrealistic in practical problems where the process is hard to model from a statistical viewpoint and may even react to the learners’s decisions, e.g., applications such 35 as portfolio management, computational finance, and whether prediction. Sequential prediction/online learning (also known as universal prediction of individual sequences) is a strand of learning theory avoiding making any stochastic assumptions about the way the observations are generated and the goal is to develop prediction methods that are robust in the sense that they work well even in the worst case. For instance, in the problem of online portfolio management [18], an online investor wants to distribute her wealth on a set of available financial instruments without any assumption on the market outcome in advance. Obviously in applications of this kind, the main challenge is that the learner can not make any statistical assumption about the process generating the instances and the data are continuously evolving or adversarially changing. Online learning or sequential prediction is an elegant paradigm to capture these problems that alleviates the statistical assumption usually made in statistical setting (see e.g., [38] and [129] for through discussion). 2.2.1 Mistake Bound Model and Regret Analysis The problem of sequential prediction may be cast as a repeated game between a decision maker- also called the forecaster- and an environment- also called adversary. In this model, learning proceeds in T consecutive rounds, as we see examples one by one. At the beginning of round t, the learning algorithm A has the hypothesis ht ∈ H and the adversary picks an instance zt = (xt , yt ). The adversary at round t can select the instance zt ∈ Ξ in an adversarial worst case fashion based on previous instances z1 , . . . , zt−1 and based on previous hypotheses h1 , . . . , ht−1 selected by the learner. Then, the learner receives the instance xt and predicts ht (xt ). At the end of the round, the true label yt is revealed to the learner and A makes a mistake if ht (xt ) ̸= yt . Unlike statistical setting, here the prediction task is sequential: the outcomes are only revealed one after another; at time t, the learner guesses 36 the next outcome yt before it is revealed. The algorithm then updates its hypothesis, if necessary, to ht+1 and this continues till time T . The sequential prediction model discussed above, which is reminiscent of the framework of competitive analysis [27], is known as mistake bound model and was introduced to learning community in [98]. In this model the goal of the learner is to sequentially deduce information from previous rounds so as to improve its predictions on future rounds. When the algorithm is conservative (or lazy), meaning that the algorithm only changes its hypothesis when it makes a mistake is called mistake driven. We note that many seemingly unrelated problems fit in the framework of the abstract sequential decision problem including online prediction problems in the experts model [37], Perceptron like classification algorithms [25], Winnow algorithm [98], and learning in repeated game playing [59]- to name a few. Here we list few sequential prediction problems to better illustrate the setting. Example 1: Online classification and regression. As an illustrative example let us consider the online binary classification problem where at each round t the learner receives an instance xt ∈ X as input and is required to predict a label yˆt ∈ Y = {−1, +1} for the input instance. Then, the learner receives the correct label yt ∈ Y and suffers the 0-1 loss: I[yt ̸= yˆt ]. As another example, let us consider online regression problems where at each round t a feature vector xt ∈ Rd is given to the online learner, and a value yt ∈ R has to be estimated using linear predictors with bounded norm, i.e, H = {x → w, x : w ∈ Rd : ∥w∥ ≤ R}, that the learner predicts ⟨w, xt ⟩. The loss function at round t for a predictor w is ℓt (w) = (yt − ⟨wt , xt ⟩)2 . Example 2: Prediction with expert advice. Every day the manager of a company 37 should decide to produce one of the K different products without knowing the market demand in advance. At the end of the day, he will be informed the gain achieved by selling the product but nothing about the potential income from other products. The goal of the manager is to maximize the income of company over a sequence of many periods. This is a problem of repeated decision-making which is called learning from expert advice where the objective functions to be optimized are unknown and revealed (perhaps only partially) in an online manner. In the general prediction with expert advice game a learner competes against K ∈ N experts in a game consisting of T rounds. Each round t, each expert reveals a prediction from Y = {0, 1}. The learner form its own prediction by sampling an expert from ht = wt ∈ H ≡ ∆K , where ∆K is the set of probabilities over K experts (i.e., simplex). The true outcome yt is then revealed and the learner and all of the experts receive a penalty depending on how well their prediction fits with the revealed outcome. The aim of the learner in this game is to incur a cumulative loss over all rounds that is not much worse than the best expert. One natural measure of the quality of learning in mistake bound model of sequential setting is the number of worst case mistakes the learner makes. In particular, under the realizability assumption (i.e. where there exists a hypothesis in H which performs perfectly on the sequence), the learner’s goal becomes to have a bounded number of mistakes which is known as mistake bound. The optimal mistake bound for a hypothesis class H is the minimum mistake bound over all learning algorithms A on the worst case sequence of examples: Mistake(A, H, Ξ, T ) = min max T ∑ h1 ,h2 ,··· ,hT ∈H x1 ,x2 ,··· ,xT ∈X t=1 I[ht (xt ) ̸= h(xt )], where h1 , h2 , · · · , hT ∈ H is the sequence of hypothesis generated by A. 38 (2.3) We say that a hypothesis class H is learnable in the online learning model if there exists an online learning algorithm A with a finite worst case mistake bound, no matter how long the sequence of examples T is. We note that the mistake bound model, in comparison to PAC model, is strong in the sense that it does not depend on any assumption about the instances. It is also remarkable that despite the inherent differences between PAC and mistake bound frameworks, mistake bounds have corresponding risk bounds that are not worse, and sometimes better, than those obtainable with a direct statistical approach. In particular, by a simple reduction, it is straightforward to show that if an algorithm A learns a hypothesis class H in the mistake bound model, then A also learns H in the probably approximately correct model. We note that due to impossibility theorem by Cover [45], any online predictor that makes deterministic predictions is doomed to achieve a sub-linear regret universally for all sequences. To circumvent this obstacle, two typical solutions have been examined which are randomization and convexification. In the former we allow the learner to make randomized predictions, making the algorithm unpredictable against the adversary, and in the latter we replace the non-convex 0-1 loss with a cover surrogate loss function (see e.g., [24] and [129]). Similar to statistical learning, we can also generalize online setting to the agnostic (a.k.a. non-realizable) setting where there is no classifier in H which performs perfectly on the sequence. In this case an adversary can make the cumulative loss of our online learning algorithm arbitrarily large. To overcome this deficiency, the performance of the forecaster is compared to some notion of ”how well it could have performed”. In particular, the performance of the online learner is compared to that of the best single decision for the sequence, in hindsight, chosen from the hypothesis in H. This brings us to the objective 39 which is commonly known as regret which is formally defined by: Regret(A, H, Ξ, T ) = T ∑ ℓ(ht , (xt , yt )) − min h∈H t=1 T ∑ ℓ(h, (xt , yt )) (2.4) t=1 where ℓ(h, (x, y)) is the loss function to measure the discrepancy between the prediction h(x) and the corresponding observed element, e.g., 0-1 loss function I[h(x) ̸= y] for binary classification. Regret measures the difference between the cumulative loss of the learner’s strategy and the minimum possible loss had the sequence of loss functions been known in advance and the learner could choose the best fixed action in hindsight. In particular, we are interested in rates of increase of Regret(A, H, Ξ, T ) in terms of T . When this is sub-linear in the number of rounds, that is, o(T ), we call the solution Hannan consistent [38], implying that the learner’s average per-round loss approaches the average per-round loss of the best fixed action in hindsight. It is noticeable that the performance bound must hold for any sequence of loss functions, and in particular if the sequence is chosen adversarially. Online learnability. In the online setting, the analogous of PAC learnability was addressed by Littlestone [98] who described a combinatorial characterization of hypothesis classes that are learnable in mistake bound model under realizability assumption. The extension of these results to agnostic online setting was addressed in [20]. Recall that, in the PAC model, VC-dimension of hypothesis class H characterizes learnability of H if we ignore computational considerations. Moroever, VC-dimension characterizes learnability in the agnostic PAC model as well. In online setting what is known as Littlestone’s dimension plays the same rule. Recently, notion of Sequential Rademacher complexity has been introduced to 40 characterizing online learnability which plays the similar role as the Rademacher complexity in statistical learning theory [126]. 2.2.2 Online Convex Optimization and Regret Bounds The online convex optimization (OCO) framework generalizes many known online learning problems in the realm of sequential prediction and repeated game playing. Among these are online classification and sequential portfolio optimization. The unified setting of OCO was introduced in [63] and the exact term was used before in [158]. Since the introduction of OCO, there have been a dizzying number of extensions and variants that is the focus of this section. Assume we are given a fixed convex set W and some set of convex functions F on W. In OCO, a decision maker is iteratively required to choose a decision wt ∈ W. After making the decision wt at round t, a convex loss function ft ∈ F is chosen by adversary and the decision maker incurs loss ft (wt ). The loss function is chosen completely arbitrarily and even in an adversarial manner given the current and past decisions of the decision maker. Online linear optimization is a special case of OCO in which the set F is the set of linear functions, i.e., F = {w → ⟨f , w⟩ : f ∈ Rd }. The goal of online convex optimization is to come up with a sequence of solutions w1 , . . . , wT that minimizes the regret, which is defined as the difference in the cost of the sequence of decisions accumulated up to the trial T made by the learner and the cost of the best fixed decision in hindsight, i.e. Regret(A, W, F, T ) = T ∑ ft (wt ) − min t=1 w∈W T ∑ ft (w). t=1 Based on the type of feedback revealed to learner by adversary at the end of each iteration, 41 we distinguish two types of OCO problem. In the full information OCO, after suffering the loss, the decision maker gets full knowledge of the function ft (·). In the partial information setting (also bandit OCO), the decision maker only learns the value ft (wt ) and does not gain any other knowledge about ft (·). We also distinguish between oblivious and adapive adversaries. In the oblivious or nonadaptive model, adversary is assumed to know our algorithm, and can pick the worst possible sequence of cost functions for it. However this sequence must be fixed in advance before game starts; during the game, adversary receives no feedback about our chosen decisions. In the more powerful adaptive model, the adversary is assumed to know not only our algorithm, but also the history of the game up to the current round. In other words, at the end of each round t, our decision wt is revealed to adversary, and the next cost function ft (·) may depend arbitrarily on w1 , · · · , wt . The design of algorithms for regret minimization in OCO setting recently has been influenced by tools from the convex optimization. It has long been known that special kinds of loss functions permit tighter regret bounds than other loss functions. Two most important family of loss functions that has been considered are convex Lipschtiz and strongly convex functions. Before presenting the known results on regret bounds for different families of loss functions, we will first need the definition of strong convexity as: Definition 2.7 (Strong Convexity). A loss function f : W → R is said to be α-strongly convex w.r.t a norm ∥ · ∥, if there exists a constant α > 0 (often called the modulus of strong convexity) such that, for any λ ∈ [0, 1] and for all w, w′ ∈ W, it holds that 1 f (λw + (1 − λ)w′ ) ≤ αf (w) + (1 − λ)f (w′ ) − λ(1 − λ)α∥w − w′ ∥2 . 2 42 If f (w) is twice differentiable, then an equivalent definition of strong convexity is ∇2 f (w) ⪰ αI which indicates that the smallest eigenvalue of the Hessian of f (w) is uniformly lower bounded by α everywhere. When f (w) is differentiable, the strong convexity is equivalent to f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ + α ∥w − w′ ∥2 , ∀ w, w′ ∈ W. 2 Henceforth, we shall review few algorithms for OCO and state their regret bounds under different assumptions on the curvature of the sequence of the adversarial loss functions. 2.2.2.1 Online Gradient Descent We start with the first, perhaps the simplest online convex optimization algorithm, which nevertheless captures the spirit of the idea behind the most of existing methods. This algorithm applies to the most general setting of online convex optimization and is referred to as Online Gradient Descent (OGD). The OGD method is rooted in the standard gradient descent algorithm and was introduced to the online setting by Zinkevich [158]. The OGD method starts with an arbitrary decision w1 ∈ W, and iteratively modifies it according to the cost functions that are encountered so far as follows: 43 Online Gradient Descent (OGD) Input: convex set W, step size η > 0 Initialize: w1 ∈ W for t = 1, 2, . . . , T ( ) Play wt+1 = ΠW wt − η∇ft (wt ) Receive loss function ft+1 (·) and incur ft+1 (wt+1 ) end for Here ΠW (·) denotes the orthogonal projection onto the convex set W. The OGD algorithm is straightforward to implement, and updates take time O(d) given the gradient. However, the projection step might be computationally cumbersome for complex domains W. The following theorem states that the regret bound for OGD method when √ applied to convex Lipschitz functions is on the order of O( T ) which has been proven to be tight up to constant factors [1]. Theorem 2.8. Let f1 , f2 , . . . , fT be an arbitrary sequence of convex, differentiable functions defined over the convex set W ⊆ Rd . Let G = maxt∈[T ] ||∇ft (w)|| and R = maxw,w′ ∈W ||w− w′ ||. Then OGD with step size ηt = R√ achieves the following guarantee for any w ∈ W, G t for all T ≥ 1. T ∑ t=1 ft (wt ) − T ∑ t=1 √ ∥w − w1 ∥2 η ∑ + ft (w) ≤ ∥∇ft (wt )∥2 = O(RG T ) 2η 2 T (2.5) t=1 In fact it is possible to show that OGD can attain a logarithmic regret O(log T ) for strongly convex functions by appropriately tuning the step sizes as stated below. 44 Theorem 2.9. Let f1 , f2 , . . . , fT : W → R be an arbitrary sequence of α-strongly convex 1 , the OGD functions. Under the same conditions as Theorem 2.8 with step size ηt = αt algorithm achieves the following regret: T ∑ t=1 ft (wt ) − T ∑ t=1 ft (w) ≤ G2 T ∑ 1 G2 ≤ (1 + log T ) = O(log T ) αt 2α (2.6) t=1 Remark 2.10. In the above algorithm, the updating rule for OGD uses the gradients of the loss functions ∇ft (w) at each iteration to update the solution. In fact it is not required to assume that the loss functions are differentiable and it suffices to assume that the loss functions only have a sub-gradient everywhere in the domain W, making the algorithm suitable for non-smooth settings. In particular for any gt ∈ ∂ft (wt ) where ∂ft (wt ) is the set of sub-gradient at point wt , the algorithm is able to achieve the same regret bounds in both cases. 2.2.2.2 Follow The Perturbed Leader The first efficient algorithm for the general online linear optimization problems is due to Hannan [65], and was subsequently rediscovered and clarified in [86]. The Follow The Perturbed Leader (FTPL) algorithm assumes that there is an oracle that can efficiently solve the offline optimization problem. Having access to such an oracle, the FTPL selects the decision that appears to be the best so far, but for a version of actual cost vectors which have been perturbed by the addition of some noise. The addition of random noise to the observed cost functions has the effect of slowing down the algorithm so that instead of tracking small fluctuations in cost functions, it has the tendency to stick to the same decision unless there is a compelling reason to switch to another decision. 45 Although the FTPL algorithm works in a similar setting as OGD, but there is a crucial difference between them which makes FTPL more suitable for online combinatorial learning. In particular, the decision set W does not need to be convex as long as the offline optimization problem can be solved efficiently. This is a significant advantage of the FTPL approach, which can be utilized to tackle more general problems with discrete decision spaces. The FTPL method for online decision making relies on a linear optimization procedure M over the set W that computes M(f ) = arg minw∈W ⟨f , w⟩ for all f ∈ Rd . Then, FTPL chooses wt by first drawing a perturbation µt ∈ [0, η1 ]d uniformly at random, and computing (∑ ) t wt+1 = M f + µ t . τ =1 τ Follow The Perturbed Leader (FTPL) Input: a general domain W, step size η > 0, offline linear oracle M over W Initialize: w1 ∈ arg minw∈W M(0) for t = 1, 2, . . . , T Draw µt ∈ [0, η1 ]d uniformly at random (∑ ) t Play wt+1 = M f + µ t τ =1 τ Receive loss vector ft+1 ∈ Rd and incur ⟨ft+1 , wt+1 ⟩ end for The regret of FTPL algorithm is stated in the following theorem. Theorem 2.11 (Regret Bound for FTPL). Let f1 , . . . , fT be an arbitrary sequence of linear loss functions from unit ball and let w1 , . . . , wT be the sequence of decisions generated by the FTPL constrained to a general set W. Then for any w ∈ W, the FTPL algorithm with 46 ft = ∇ft (wt ) ft (w) . Adversary Linearization wt+1 ∈ W FTRL wt+1 ∈ W Figure 2.2: Reduction of general online convex optimization problem to online optimization with linear functions. parameter η = √ R/GT satisfies:   T T ∑ ∑ √ E  ⟨ft , wt ⟩ − ⟨ft , w⟩ ≤ 2 RGT , t=1 t=1 where maxw |⟨ft , w⟩| ≤ G, t ∈ [T ] is an upper bound on the magnitude of the rewards, and R is an upper bound on the ℓ1 diameter of W, i.e., R ≥ maxw,w′ ∈W ∥w − w′ ∥1 . 2.2.2.3 Follow The Regularized Leader A natural modification of the basic FTPL algorithm or fictitious play in game theory is the Follow The Regularized Leader (FTRL) algorithm, in which we minimize the loss on all past rounds plus a regularization term. Regularization is an alternative to perturbation to stabilize decisions during the prediction. Naturally, different regularization functions will yield different algorithms for different applications. In FTRL we assume that the loss functions are linear ft (w) = ⟨ft , w⟩, ft ∈ Rd and the generalization to convex loss functions can be accomplished by linearization strategy. This reductions is shown in Figure 2.2. The main idea behind linearization trick is that if an algorithm A is able to achieve good regret against linear loss functions, then A can be used to achieve good regret against sequences of convex loss functions as well. To see this, we 47 note that from the definition of convexity, i.e, ft (w) ≥ ft (wt ) + ⟨∇ft (wt ), w − wt ⟩, we can feed the learner A with linear loss functions ft′ (w) = ⟨∇ft (wt ), w⟩ and then guarantee that ft (wt ) − ft (w) ≤ ⟨ft , wt − w⟩. Therefore, any bound on the regret of linear loss functions directly translate to a regret bound for convex losses. The FTRL algorithm at each round t solves an offline optimization problem based based on the sum of loss functions up to time t − 1 and a regularization function. Let R(w) be a strongly convex differentiable function. The detailed steps of FTRL are as follows [3]: Follow the Regularized Leader (FTRL) Input: convex set W, step size η > 0, regularization R(w) Initialize: w1 ∈ arg minw∈W R(w) for t = 1, 2, . . . , T Play wt+1 = arg minw∈W [ ] ∑t s=1 ⟨fs , w⟩ + R(w) Receive loss function ft+1 (·) and incur ft+1 (wt+1 ) Set ft+1 = ∇ft+1 (wt+1 ) end for It is noticeable that what FTRL implements at each iteration is the regularized empirical risk minimization over the previous trials as we saw in the statistical learning. In online setting, the regularizer has the role of forcing the consecutive solutions is to stay closer to each other, similar role role the perturbation plays in FTPL algorithm. Furthermore, different choices of the regularizer lead to different algorithms. For instance, in the simplest case if we let R(w) = 12 ||w||2 , then the FTRL behaves same as OGD algorithm. 48 It is not hard to prove a simple bound on the regret of the FTRL algorithm for a given strongly convex regularization function R(w) and the learning rate η. Theorem 2.12 (Regret Bound for FTRL). Let f1 , f2 , . . . , fT ∈ Rd be an arbitrary sequence of linear loss functions over convex set W ⊆ Rd . Let R : K → R be a 1-strongly convex function with respect to a norm ∥ · ∥. Let R = maxw∈W R(w) − R(w1 ) and maxt∈[T ] ∥ft ∥∗ ≤ G. √ R the regret of FTRL is bounded by: Then for any w ∈ W by setting η = 2 G T T ∑ t=1 2.2.2.4 ⟨ft , wt ⟩ − T ∑ t=1 ∑ √ R(w) − R(w1 ) ⟨ft , w⟩ ≤ +η ∥ft ∥2∗ = O(G RT ) η T (2.7) t=1 Online Mirror Descent Another algorithm for OCO problem is the online version of celebrated proximal point algorithm in offline convex optimization. As mentioned before, the implicit goal of regularization used in the FTRL algorithm is to control by how much the consecutive solutions differ from each other. The proximal point algorithm is designed with the explicit goal of keeping wt+1 as close as possible to wt . The closeness of two solutions are measured by the Bregman divergence induced by a strongly convex Legendre function Φ(·) defined over convex domain K. A Legendre function is a strictly convex functions with continuous partial derivatives and gradient blowing up at the boundary of its domain (see Appendix ?? for detailed discussion). We assume that W ∩ K ̸= ∅. Online proximal point method solves an optimization problem which expresses the tradeoff between the distance from the old solution and the loss suffered by by the current convex function as: [ wt+1 = arg min w∈W∩K ] ηt ft (w) + BΦ (w, wt ) , 49 (2.8) where BΦ (w, wt ) is the Bregman divergence induced by Φ(·) (Definition A.17). When the loss functions are linear, i.e., ft (w) = ⟨ft , w⟩ for some ft ∈ Rd , or one replaces the objective ft (w) in (2.8) with its linearized term, i.e., ft (w) ≈ ft (wt )+⟨w − wt , ∇ft (wt )⟩, the proximal point method becomes the Online Mirror Descent (OMD) algorithm as detailed below: Online Mirror Descent (OMD) Input: convex sets W, step size η > 0, Legendre function Φ : K → R Initialize: w1 ∈ arg minw∈K Φ(w) for t = 1, 2, . . . , T [ ] Play wt+1 = arg minw∈W∩K ηt ⟨w − wt , ∇ft (wt )⟩ + BΦ (w, wt ) Receive loss function ft+1 (·) and incur ft+1 (wt+1 ) end for Theorem 2.13 (Regret Bound for OMD). Let f1 , f2 , . . . , fT be an arbitrary sequence of convex, differentiable functions defined over the convex set W ⊆ Rd . Let G = maxt ||∇ft (wt )||∗ and R = maxw,w′ ∈W ||w − w′ ||. Let Φ : K → R be a Legendre function which is 1-strongly convex w.r.t the norm ∥ · ∥ and W ∩ K ̸= ∅. Then the regret of the OMD can be bounded by T ∑ t=1 ft (wt ) − T ∑ t=1 √ BΦ (w, w1 ) η ∑ + ∥∇ft (wt )∥2∗ = O(RG T ), η 2 T ft (w) ≤ (2.9) t=1 where ∥ · ∥∗ is the dual norm to ∥ · ∥. We note that many classical online learning algorithms can be viewed as variants of OMD, generally either with the Euclidean geometry such as Perceptron algorithm and OGD, or in 50 the simplex geometry, using an entropic distance generating function such as Winnow [98] and Online Exponentiated Gradient algorithm [90]. 2.2.2.5 Online Newton Step As mentioned in the analysis of OGD algorithm, it attains a logarithmic regret O(log T ) if the sequence of loss functions which have bounded gradient and are strongly convex. Another case in which we can obtain logarithmic regret is the case of exp-concave loss functions (i.e., the function exp(−αf (w)) is concave for some α). Exp-concavity is weaker condition than the bounded gradient and strong convexity. Online Newton Step (ONS) [68] is the adaption of Newton method for convex optimization to online setting and is able to achieve logarithmic regret when learned on exp-concave functions which makes is more general than OGD. Online Newton Step (ONS) Input: convex sets W, step size η > 0 Initialize: w1 ∈ arg minw∈K Φ(w) for t = 1, 2, . . . , T ( ) Play wt+1 = ΠWt wt − ηt A−1 t [∇ft (wt )] A where At−1 = ∑t−1 ⊤ s=1 ∇fs (ws )∇fs (ws ) Receive loss function ft+1 (·) and incur ft+1 (wt+1 ) end for A Here ΠA W (·) is the projection induced by the a matrix A, i.e., ΠW (w) = minz∈W (z − w)⊤ A(z − w). We note that compared to the provisos algorithms which only exploit firstorder information about the loss functions, the analysis of ONS method is based on second 51 order information, i.e. the second derivatives of the loss functions, whereas the implementation of ONS relies only on first-order information. The following result shows that the ONS method achieves a logarithmic regret for expconcave functions which is stronger result than the performance of OGD for strongly convex losses. Theorem 2.14 (Regret Bound for ONS). Let f1 , f2 , . . . , fT be an arbitrary sequence of αexp-concave functions defined over the convex set W ⊆ Rd with G = maxt ||∇ft (wt )||. Let R = maxw,w′ ∈W ||w − w′ ||. Then the ONS algorithm achieves the following regret bound for any w ∈ W: T ∑ t=1 ft (wt ) − T ∑ t=1 ft (w) ≤ (1 α ) + GR d log T = O(d log T ) (2.10) We note that also this result seems interesting, but for some of the functions of interest such as logistic loss the exp-cancavity parameter α could be exponentially large in d [113]. The exponential dependence on the diameter of the feasible set can make this bound worse √ than the O( T ) bound obtained by OGD. 2.2.3 Variational Regret Bounds Most previous works, including those discussed above, considered the most general setting in which the loss functions could be arbitrary and possibly chosen in an adversarial way. However, the environments around us may not always be adversarial, and the loss functions may have some patterns which can be considered to achieve a smaller regret. Consequently, it is objected that requiring an algorithm to have a small regret for all sequences leads to results that are too loose to be practically interesting. As a result, the bounds obtained for 52 worst case scenarios become pessimistic for these regular sequences. Therefore, it would be desirable to develop algorithms that yield tighter bounds for more regular sequences, while still providing protection against worst case sequences. To this end, we need to replace the number of rounds appeared in the regret bound with some other notion of performance. In particular, this new measure should depend on variation in the sequence of costs functions emitted to the learner. Having such an algorithm guarantees that if the cost sequence has low variation, the algorithm would be able to perform better. One work along this direction is that of [69]. For the online linear optimization problem, in which each loss function is linear ft (w) = ⟨ft , w⟩ and can be seen as a vector, they considered the case in which the loss functions have a small variation, defined as Variation(A, W, F, T ) = T ∑ ∥ft − µ∥22 , t=1 ∑ where µ = Tt=1 ft /T is the average of the loss functions. For this, they showed that a re√ gret of O( Variation) can be achieved, and they also have an analogous result for the prediction with expert advice problem. According to this definition, a small Variation(A, W, F, T ) means that most of the loss functions center around some fixed loss function µ. This seems to model a stationary environment, in which all of the loss functions are produced according to some fixed distribution. The variation bound is defined in terms of total difference between individual linear cost vectors to their mean. In Chapter 5 of this thesis, we introduce another measure which is called gradual variation. Gradual variation is more general and applies to environments which may be evolving but is a somewhat gradual way. For example, the weather condition or the stock price at one moment may have some correlation with the next and their difference is 53 usually small, while abrupt changes only occur sporadically. Formally, the gradual variation of a sequence of loss functions is defined as: GradualVarition(A, W, F, T ) = T∑ −1 t=1 max ∥∇ft+1 (w) − ∇ft (w)∥22 . w∈W It is easy to verify that the gradual variation lower bounds the variation bound and hence algorithms with regret bounded by gradual variation are more adaptive to regular patterns than algorithm with bounded variational regret bounds. 2.2.4 Bandit Online Convex Optimization In bandit OCO, once the online learner commits to the decision wt at round t, he does not have access to the function ft (·) chosen by adversary and instead receives the scalar loss ft (wt ) he suffers at point wt . In the optimization community, this problem usually known as zeroth-order or derivative-free convex optimization as we only have access to function values to solve the optimization problem [79, 136]. A simple approach for bandit OCO which was the main dilemma in most existing works is to utilize a reduction to the full information OCO setting. To do so, one needs to approximate the gradient of the loss functions at each iteration based on the observed scalar loss and feed it to the full information algorithm. This reduction has been illustrated in Figure 2.3. A simple idea to estimate the gradients has been utilized in [54] and a modified gradient descent approach for bandit OCO has been presented that attains O(T 3/4 ) regret bound. The key idea of their algorithm is to compute the stochastic approximation of the gradient of cost functions by single point evaluation of the cost functions. The main observation was that one can estimate the gradient of a function f (w) by taking a random vector u from unit sphere S = {x ∈ Rd ; ∥x∥ = 1} and 54 ft (wt ) . Adversary gt Bandit OCO wt+1 ∈ W Full OCO zt+1 ∈ (1 − ξ)W Figure 2.3: Reduction of bandit online convex optimization to online convex optimization with full information. The full OCO needs to play from a shrinked domain (1 − ξ)W to ensure that the sampled points belong to the domain. ˆ = f (w + δu)u. Then E[ˆ scaling it by f (w + δu), i.e. g g] is proportional to the gradient of a smoothed version of f (·) defined as fˆ(w) = Ev∈B [f (w + δv)] for v random from the unit ball B = {x ∈ Rd ; ∥x∥ ≤ 1}. To ensure that the sampled query points belong the domain W, the OGD is run over a shrinked domain (1 − ξ)W where we further assume that rB ⊆ W ⊆ RB. Expected Online Gradient Descent (EOGD) Input: convex set W, step size η > 0, δ, r, and ξ Initialize: z0 = 0 for t = 1, 2, . . . , T Pick a random unit vector ut uniformly at random Play wt = zt + δut and observe ft (wt ) ( ) Update zt+1 = Π(1−ξ)W zt − ηft (wt )ut end for Theorem 2.15 (Regret Bound for EOGD). Let f1 , f2 , . . . , fT be a sequence of convex, differentiable functions defined over the convex domain W ⊆ Rd where rB ⊆ W ⊆ RB. Let g1 , . . . , gn are vector-valued random variables with E[gt |wt ] = E[∇ft (wt )] and ∥gt ∥ ≤ G, 55 √ , ξ = δ/r, and δ = T −1/4 for some G > 0. Then, for η = GR n E [∑ T √ RdGr 3(rG+C) ] T ∑ ft (wt ) − min ft (w) ≤ O(T 3/4 ) t=1 w∈W t=1 This regret bound is later improved to O(T 2/3 ) [9] for online bandit linear optimization. More recently, [46] proposed an inefficient algorithm for online bandit linear optimiza√ tion with the optimal regret bound O(poly(d) T ) based on multi-armed bandit algorithm. The key disadvantage of [46] is that it is not computationally efficient. Abernethy, Hazan, and Rakhlin [2] presented an efficient randomized algorithm with an optimal regret bound √ O(poly(d) T ) that exploits the properties of self-concordant barrier regularization. For general online bandit convex optimization, [5] proposed optimal algorithms in a multi-point bandit setting, in which multiple points can be queried for the cost values. With √ multiple queries, they showed that the EOGD algorithm can give an O( T ) expected regret bound. The key idea of multiple point bandit online convex optimization, proposed in [5], is to approximate the gradient using two point function evaluations. More specifically, at each iteration t we randomly choose a unit direction ut and measure the function values at points wt + δut and w − δut , i.e. ℓt (wt + δut ) and ℓt (wt − δut ), where δ > 0 is a small perturbation that is O(1/T ). Given two point function evaluation, we approximate the gradient ∇ft (wt ) d (f (w + δu ) − f (w − δu )) u . The nice property of this sampling strategy is by gt = 2δ t t t t t t t that the norm of sampled gradient is no longer dependent on δ, i.e., ∥gt ∥ ≤ dG, and yield the same regret bound as OGD. 56 2.2.5 From Regret to Risk Bounds So far we have dealt with two different models for learning: statistical and sequential or online. The online setting is in stark contrast to the statistical setting in few aspects. First, in statistical learning or the batch model there is a strict division between the training phase and the testing phase. In contrast, in the online model, training and testing occur together all at the same time since every example acts both as a test of what was learned in the past and as a training example for improving our predictions in the future. This requires that the online learner to be adaptive to the environment. A second key difference relates to the generation of examples. Statistical learning scenario follows the key assumption that the distribution over data points is fixed over time, both for training and test points, and samples are assumed to be drawn i.i.d. from an underlying distribution D. Furthermore, the goal is to learn a hypothesis with a small expected loss or generalization error. In contrast, in online setting there is no notion of generalization and algorithms are measured using a mistake bound model and the notion of regret, which are based on worst-case or adversarial assumption where the adversary deliberately trying to ruin the learners performance. Finally, we distinguish between the processing model of statistical and online learning settings. Online algorithms process one sample at a time and can thus be significantly more efficient both in time and space and more practical than batch algorithms, when processing modern data sets of several million or billion points. hence, these algorithms are more suitable for large scale learning. This stands in contrast to statistical learning algorithms such as ERM and it would be tempting to switch to online learning algorithm. Given the close relationship between these two settings and clear advantage of online 57 learning from computational viewpoint, a paramount question is ”whether or not algorithms developed in the sequential setting can be used for statistical learning with guaranteed generalization bound? ”. More precisely, can we devise algorithms that exhibits the desirable characteristics of online learning but also has good generalization properties. Since a regret bound holds for all sequences of training samples, it also holds for an i.i.d. sequence. What remains to be done is to extract a single hypothesis out of the sequence produced by the sequential method, and to convert the regret guarantee into a guarantee about generalization. Such a process has been dubbed an Online-to-Batch Conversion and foreshadows a key achievement, which is any online learning algorithm with sub linear regret can be converted into a batch algorithm. Here we introduce two methods to convert an online algorithm that attains low regret into a batch learning algorithm that attains low risk. Such online-to-batch conversions are interesting both from the practical and the theoretical perspectives [36, 48, 83]. Formally, let f1 , f2 , · · · , fT be an i.i.d sequence of loss functions, ft : W → R. In statistical setting one can think of each loss function ft (w) as ft (w) = ℓ(w, (xt , yt )) for a fixed loss function ℓ : W ×Ξ → R+ and a random instance (xt , yt ) ∈ Ξ sampled following the underlying distribution D. We feed these loss functions to an online learning algorithm A and assume that the online learner produces a sequence w1 , w2 , · · · , wT ∈ W of hypothesis. ˆ ∈ W with small generalization error. Here The goal is to construct a single hypothesis w we consider two solutions for this problem: randomized conversion and averaging. The simplest conversion scheme is to simply choose a random hypothesis uniformly at random from the sequence of hypothesis w1 , w2 , · · · , wT . At the first glance this idea seems naive, but it has few desirable properties. First, the average loss of the algorithm is an ˆ i.e., E[ℓ(w)] ˆ = (1/T ) unbiased estimate of the expected risk of w, 58 ∑T t=1 ft (wt ). Second, the conversion is applicable regardless of any convexity assumption. Finally, in expectation, the ˆ is upper bounded by the average per-round regret of online learner, i.e., excess loss of w [ ] E Regret(A, W, F, T ) ˆ − min LD (w) ≤ LD (w) . T w∈W An alternative solution which is only applicable to learning from convex loss functions ˆ = (1/T ) over convex hypothesis spaces, is to output the average solution w ∑T t=1 wt . This conversion is also enjoys the same properties as the randomized conversion with an additional important feature. That is, we can able to show high-probability bounds on the excess risk provided that loss functions are bounded. 2.3 Convex Optimization A generic convex optimization problem may be written as min f (w) subject to w ∈ W, where f : Rd → R chosen from a specific family of functions F is a proper convex function, and W ⊆ Rd is nonempty, compact, and convex set which is also called the constraint or feasible set. We denote by w∗ the optimal solution to above problem and assume that it exists, i.e., w∗ = arg minw∈W f (w). Ideally, the goal of an optimization algorithm is to compute the optimal solution, but almost always it is impossible to compute an exact w∗ in finite time. hence, we turn to find an ϵ-approximate solution. A solution w ∈ W is an ϵ sun-optimal if f (w) − minw′ ∈W f (w′ ) ≤ ϵ. For a given family F of convex functions over the feasible set W, our primary focus is to 59 determine the efficiency of an optimization procedure to produce sub-optimal solutions. To analyze the efficiency of convex optimization algorithm one typically follows the oracle model of optimization which lies in the heart of the complexity theory of convex optimization [119, 121]. 2.3.1 Oracle Complexity of Optimization A typical convex optimization procedure initially picks some point in the feasible convex set W and iteratively updates these points based on some local information about the function it calculates around these successive points. The method can decide which points to query at based on the results of earlier queries, and tries to use as few queries as possible to achieve its task. The crucial question we are interested to answer about a specific optimization problem is the number of queries the algorithm makes to find an ϵ-accurate solution. The oracle complexity is a general model to analyze the computational complexity of optimization algorithms. In the oracle model, there is an oracle O and an information set I. The oracle O is simply a function ψ : W → I that for any query point w ∈ W returns an output from I. The information set provided to the algorithms varies depending on the type of the oracle. In particular, a zero-order oracle returns f (w) for a given query w ∈ W, first-order oracle returns gradient I = {∇f (w)} (respectively a sub-gradient I = {g ∈ ∂f (w)} if the function is not differentiable), and a second-order oracle return the Hessian at the queried point. We also distinguish between noisy (or stochastic) and exact (or deterministic) oracle models. In the noisy oracle model, the information returned by the oracle are corrupted with zero-mean noise with bounded variance. The algorithm iteratively updates the solution based on the information accumulated in 60 Oracle Lipschitz Lipschitz & Strongly Deterministic Stochastic ρ ϵ2 ρ ϵ2 ρ2 αϵ ρ2 α2 ϵ Smooth β √ ϵ β ρ ϵ + ϵ2 Smooth & Strongly √ κ log αϵ ( ) √ 1 κ log βϵ + αϵ Table 2.1: Lower bound on the oracle complexity for stochastic/deterministic first-order optimization methods. Here ρ, α, and β are the Lipschitzness, strong convexity, and smoothness parameters, respectively. The parameter κ is the condition number of function and is defined as κ = β/α. previous iterations. In particular, in optimization with zero and first-order exact oracle model which is the main focus of large scale optimization methods, an optimization method updates the solution using wt = ϕt (w0 , . . . , wt−1 , ∇f (w0 ), . . . , ∇f (wt−1 ), f (w0 ), . . . , f (wt−1 )) where ϕt : × ∪ts=1 Is → W is updating mechanism utilized by the optimization algorithm at iteration t to determine the next query point wt . Roughly speaking, we measure complexity of an algorithm by the number of queries that it makes to a prescribed oracle for computing the final solution. Given a positive integer T corresponding to the number of iterations, the minimax oracle optimization error after T steps, over a set of functions F, is defined as follows: ( ) OracleComplexity(F, W, O, T ) = inf sup f (wT ) − inf f (w) . ψ f ∈F w∈W In other words, the minimax oracle complexity is the best possible rate of convergence (as a function of the number of queries) for the optimization error when one restricts to black-box procedures in order to guarantee delivering an ϵ-accurate solution to any function f ∈ F. A large body of literature is devoted to obtaining rates of convergence of specific pro- 61 cedures for various set of convex functions F of interest (essentially smooth/non-smooth, and strongly convex/non-strongly convex) and different types of oracles (essentially noisy or stochastic/deterministic or exact, zero order or derivative free, first order, and second order). The oracle complexity of first-order deterministic and stochastic oracle models are summarized in Table 2.1 for different family of loss functions elicited from [121] for deterministic and from [4, 119] for stochastic optimization. The algorithms which attain these lower bounds will be discussed later. 2.3.2 Deterministic Convex Optimization Here we briefly review the optimization algorithms in the first-order oracle model which are called gradient based methods for simplicity. More precisely, we assume that the only information the optimization methods can learn about the particular problem instance is the values and derivatives of these components (f (w), ∇f (w)) at query points w ∈ W. Recently, first-order methods have experienced a renaissance in the design of fast algorithms for largescale optimization problems. This is due the fact that although higher order methods such as interior point methods [118] have linear convergence rate, but this fast rate comes at the cost of more expensive iterations, typically requiring the solution of a system of linear equations in the input variables. Consequently, the cost of each iteration typically grows at least quadratically with the problem dimension, making interior point methods impractical for very-large-scale convex programs. The convergence rate of gradient based methods usually depends on the properties of the objective function to be optimized. When the objective function is strongly convex and smooth, it is well known that gradient descent methods can achieve a geometric convergence rate [33]. When the objective function is smooth but not strongly convex, the optimal 62 convergence rate of a gradient descent method is O(1/T 2 ), and is achieved by the Nesterov’s methods [93]. For the objective function which is strongly convex but not smooth, the convergence rate becomes O(1/T ) [134]. For general non-smooth objective functions, the √ optimal rate of any first order method is O(1/ T ). Although it is not improvable in general, recent studies are able to improve this rate to O(1/T ) by exploring the special structure of the objective function [123, 122]. In addition, several methods are developed for composite optimization, where the objective function is written as a sum of a smooth and a non-smooth function [94, 93, 96]. The proof of coming results can be found in [121] and in the reference papers. 2.3.2.1 Gradient Descent Method Perhaps the simplest and most intuitive algorithm for deterministic optimization is gradient decent (GD) method which which was proposed by Cauchy in 1846 [35] 1 . To find a solution within the domain W that optimizes the given objective function f (w), GD computes the gradient of f (w) by querying a first-order deterministic oracle, and updates the solution by moving it in the opposite direction of the gradient. To ensure that the solution stays within the domain W, GD has to project the updated solution back into the W at every iteration. 1 The original Cauchy’s algorithm uses the direction that descends most and the best step-size which convergences slowly. Afterwards, a lot of researches have been done on how to choose the step-size for more efficient algorithms [64, 16] 63 Projected Gradient Descent (GD) Input: convex set W, η > 0, function f ∈ F, first-order oracle O Initialize: w1 ∈ W for t = 1, 2, . . . , T Query the oracle O at point wt to get ∇f (wt ) Update wt+1 = ΠW (wt − η∇f (wt )) end for Theorem 2.16 (Convergence Rate of GD). Assume that f ∈ F be a convex function defined over the convex domain W ⊆ Rd . Let w∗ = arg minw∈W f (w) be the optimal solution. Then, for the convergence rate of GD algorithm √ we have • if f be ρ-Lipschitz, by setting η = R ρ t   T ∑ 1 ρ∥w∗ − w1 ∥ √ f . wt  − f (w) ≤ T T t=1 • if f be β-smooth by setting η = β1 we have f (wT ) − f (w∗ ) ≤ 2β∥w∗ − w1 ∥2 . T β • if f be β-smooth and α-strongly convex, and κ = α be the condition number of f , by 64 2 we have setting η = α+β β f (wT ) − f (w∗ ) ≤ ∥w1 − w∗ ∥2 2 ( ) κ−1 T . κ+1 By comparing the rates obtained in Theorem 2.16 to the lower bounds in Table 2.1, one can realize that the GD obtains the optimal bound only for Lipschitz functions. We also note that by examining the bounds it turn out that the GD method is independent of the dimension of the convex domain W as long as the Euclidean norm of solutions and gradients are independent of the ambient dimension of convex domain W which makes it attractive for optimization in high dimension. The computational bottleneck of the projected GD is often the projection step which is a convex optimization problem by itself and might be expensive for many domains. In Chapter 9 we propose efficient optimization methods which do not require intermediate projection steps. 2.3.2.2 Accelerated Gradient Descent Method The convergence rate of GD method for optimization smooth loss functions is O(1/T ) which is far away from the lower bound O(1/T 2 ) discussed before. Nesterov showed in 1983 that we can improve the convergence rate of GD without using anything more than gradient information at various points of the domain. Accelerated GD [120, 121, 123] bridges the gap between the lower bound for smooth optimization and lower bound provided by oracle complexity with a simple twist of GD method and is able to obtain the optimal O(1/T 2 ) convergence rate for minimizing smooth functions. 65 Accelerated Gradient Descent (AGD) Input: η > 0, function f ∈ F, first-order oracle O Initialize: w0 = z0 = 0, λ0 = 0 for t = 1, 2, . . . , T Query the oracle O at point wt to get ∇f (wt ) √ ) ( 1−η 1 2 Set ηs = 2 1 + 1 + 4ηt−1 , and γt = η t . t+1 Update zt+1 = zt − β1 ∇f (wt ) Update wt+1 = (1 − γt )zt+1 + γt zt end for The following theorem shows that AGD achieves an O(1/T 2 ) convergence rate which is tight. Theorem 2.17 (Convergence Rate of AGD). Let f ∈ F be a convex and β-smooth function and w∗ be the optimal solution. Then the accelerated gradient descent outputs a solution which satisfies: f (zT ) − f (w∗ ) ≤ 2.3.2.3 2β∥w1 − w∗ ∥2 . T2 Mirror Descent Method Mirror Descent (MD) is a first-order optimization procedure which generalizes the classic GD method to non-Euclidean geometries by relying on a distance generating function specific to the geometry. The original MD algorithm was developed to perform the gradient descent in spaces where the gradient only makes sense in the dual space. In this cases, the MD first maps the point wt into a dual space by mapping Φ, then performs the gradient update in 66 the dual space, and finally maps the resulting point back to the primal space. When the mapping Φ(w) = 12 ∥w∥2 then the primal and dual spaces are same and the MD performs a simple gradient descent. Mirror Descent (MD) Input: η > 0, function f ∈ F, first-order oracle O Initialize: w0 = 0 for t = 1, 2, . . . , T Query the oracle O at point wt to get ∇f (wt ) Update ∇Φ(zt+1 ) = ∇Φ(wt ) − η∇f (wt ) Update wt+1 = argminw∈W∩K BΦ (w, zt+1 ) end for Theorem 2.18 (Convergence Rate of MD). Let Φ be a mirror map. Assume also that Φ is α-strongly convex on W ∩ K with respect to ∥ · ∥. Let R = supw∈K∩W Φ(w) − Φ(w1 ) and f √ ρ be convex and ρ-Lipschitz w.r.t. ∥ · ∥, then MD algorithm with η = R 2α T satisfies ( ) T 1∑ f wt − min f (w) ≤ ρR T w∈W t=1 √ 2 . αT The MD algorithm can alternatively be expressed as nonlinear projected sub-gradient type method, derived from a general distance generating function (Bregmen divergence in Definition ??) instead of the usual Euclidean squared distance as [17]: } { 1 wt+1 = min ⟨w, ∇f (wt )⟩ + BΦ (w, wt ) . η w∈W 67 (2.11) Remark 2.19. In terms of convergence rate, the MD obtains the same rate as GD method but MD has advantage by exploiting the geometry of convex domain. More specifically, since MD method adapts to the structure of domain W via mapping Φ, it has less dependency on the dimensionality of the domain which could be appealing for large scale optimization problems. As an example, it is easy to verify that for optimization over simplex, i.e., ∆ = {w ∈ Rd++ : ∑ i wi = 1} by using the negative entropy Φ(w) = ∑d i=1 log wi as the mapping function, the dependency of MD to d is in order of log d, while regular GD algorithm has a linear O(d) dependency. 2.3.2.4 Mirror Prox Method In the black-box oracle model the algorithm has access to the values and gradients of function, without knowing the structure of the objective function. But in many circumstances we never meet a pure black box model and have some information about the structure of the underlying function. Intestinally, the proper use of the structure of the problem can help to obtain better convergence rate for specific family of loss functions [122, 123, 117]. In particular, in [117] it been shown that for non-smooth Lipschitz continuous functions which admit a smooth saddle-point representation one can obtain a rate of convergence of order O(1/T ) with a properly designed gradient descent method, despite the fact that the original function is √ non-smooth and can not be optimized with a convergence rate better then O(1/ T ) in black-box model. As an example consider the function f to be optimized is of the form f (w) = max1≤i≤n fi (w) where each individual functions fi (w), i ∈ [n] is convex, β-smooth and ρ-Lipschitz in some norm ∥ · ∥. In this case the function f (w) is not smooth and the √ best convergence rate one can hope in the black-box model is O(1/ T ). Let Φ : K → R be a mirror map on W and let w1 ∈ argminw∈W∩K Φ(w). The mirror 68 prox (extragradient in a specialized case) method is detailed below. Extra Gradient Descent Method (EGD) Input: η > 0, function f ∈ F , first-order oracle O Initialize: w1 = z1 = 0 for t = 1, 2, . . . , T Query the oracle O at point wt to get ∇f (wt ) Update ∇Φ(z′t+1 ) = ∇Φ(wt ) − η∇f (wt ) Update zt+1 ∈ argminz∈W∩K BΦ (z, z′t+1 ) and query the oracle to get ∇f (zt+1 ) ′ ) = ∇Φ(w ) − η∇f (z Update ∇Φ(wt+1 t t+1 ) ′ ) Update wt+1 ∈ argminw∈W∩K BΦ (w, wt+1 end for The EGD method first makes a step of MD to go from wt to zt+1 , and then it makes a similar step to obtain wt+1 , starting again from wt but this time using the gradient of f evaluated at zt+1 . The following theorem exhibits the rate of convergence for EGD algorithm. Theorem 2.20 (Convergence Rate of EGD). Let Φ be a α-strongly convex on K ∩ W with respect to ∥ · ∥. Let R = supw∈K∩W Φ(w) − Φ(w1 ) and f be convex and β-smooth w.r.t. ∥ · ∥. Then EGD with η = α β has a convergence rate as: ( ) T 1∑ βR2 . f zt − min f (w) ≤ T αT w∈W t=1 69 2.3.2.5 Conditional Gradient Descent Method The main computation bottleneck of gradient descent methods in solving constrained optimization problems is the projection step which might be as hard as solving the original optimization problem. Surprisingly the projection step can be avoided by replacing the expensive projection operation with other kinds of light computational operations. One such an example is the Conditional Gradient Descent (CGD) method which is also known as Frank-Wolf algorithm. The Frank-Wolfe method, that was originally introduced in a paper by Frank and Wolfe from the 1950 [56], where they aimed to present an algorithm for minimizing a quadratic function over a polytope using only linear optimization steps over the feasible set. The CGD algorithm proceeds by iteratively solving a linear optimization problem to find a direction pt inside the domain W and updating the solution as a linear combination of the obtained direction and previous solution. This procedure guarantees that the updated solutions remain inside the feasible domain W. This method replaces the projection step with a linear optimization problem over the constrained domain which is more efficient as long as the linear problem is easy to be solved. 70 Conditional Gradient Descent (CGD) Input: convex set W, η > 0, a smooth convex function f ∈ F Initialize: w1 ∈ W for t = 1, 2, . . . , T Find pt = arg minp∈W ⟨∇f (wt ), p⟩ Update wt+1 = (1 − ηt )wt + ηt pt end for The following result shows the convergence rate of CGD for smooth functions. Theorem 2.21 (Convergence Rate of CGD). Assume that f ∈ F be a β-smooth convex function with respect to some norm ∥ · ∥ defined over the convex domain W. Let R = 2 in CGD method, we have: supw,w′ ∥w − w′ ∥. Then by setting ηt = t+1 f (wT ) − f (w∗ ) ≤ 2βR2 t+1 In Chapter 9 we will show that by replacing the projection step with gradient computation of constrain function, it is possible to devise efficient stochastic optimization methods which only require a single projection at the final iteration. 2.3.3 Stochastic Convex Optimization So far we assumed the the optimization algorithm has access to a noiseless oracle. It is more realistic to consider noisy oracles, where one does not have access to exact objective function or gradient values, but rather to their noisy estimates (usually zero mean and bounded 71 variance). In particular, for a fixed closed convex subset W ⊂ Rd of Rd we consider the following optimization problem: ∫ min f (w) for f (w) = E[F (w, ξ)] = w∈W F (w, ξ)dP (ξ), (2.12) Ξ where we assume that the expected value function f (w) is continuous and convex on W. We note if the function F (w, ξ) be convex on W, then it follows that f (w) is also convex and the problem becomes a convex programming problem. The main difficulty in solving the stochastic optimization problem in (2.12) is that the multidimensional integral (expectation) cannot be computed with a high accuracy [115], and in statistical learning problems we usually do not know what the distribution P is. Therefore, there are two solutions to address this issue: these are stochastic approximation (SA) and the sample average approximation (SAA) methods. The main idea of SAA approach to solving stochastic programs is as follow. A sample ξ1 , ξ2 , · · · , ξn of n realizations of the random vector in objective is generated and the stochastic objective is approximated by the sample average function. Then, a deterministic optimization algorithm is applied to solve the approximate function. We note that we can not perform a gradient descent on f (w) as we would need to know the underlying distribution to compute a gradient of f (w) In SA we assume that there is an stochastic oracle O, which, for a given point (w, ξ) ∈ W × Ξ returns an unbiased estimates of subgradient of f (w). In other words, it returns g such that E[g] ∈ ∂f (w). Stochastic optimization methods allow the optimization method to take a step which is only in expectation along the negative of the gradient. Based on this oracle, a simple algorithm to optimize the objective is Stochastic Gradient Descent (SGD). SGD is in the same spirit of GD but it replaces the true gradients with stochastic gradients 72 in updating the solutions: Stochastic Gradient Descent (SGD) Input: convex set W, η > 0, function f ∈ F , stochastic first-order oracle O Initialize: w0 = 0 for t = 1, 2, . . . , T Query the stochastic oracle O at point wt to get gt where E[gt ] ∈ ∂f (wt ) Update wt+1 = ΠW (wt − ηgt ) end for ∑ ˆ = T1 Tt=1 wt Return: w Under mild conditions as outlined below, one con show the SGD algorithm convergence √ to the optimal solution with convergence rate O(1/ T ) with a high probability: Eξt [gt ] = ∇f (w) Eξt [exp(∥gt − ∇f (w)∥2∗ /σ 2 )] ≤ exp(1). It is also straightforward to generalize the the mirror descent method to stochastic setting by replacing the Euclidean distance in the update of SGD with another Bregman divergence adopted to a specific domain. By comparing SGD method for stochastic optimization and OGD method for regret minimization, we note that both methods are closely related. Although both methods looks similar algorithmically, but there are main conceptual differences between SGD and OGD. We note that in stochastic optimization the goal is to generate a sequence of solutions which 73 quickly convergences to the minimum of a function defined as f (w) = E[F (w, ξ)], while is online learning the goal is to generate a sequence of solutions that accumulates a small loss during the learning measured in terms of regret. In other words SGD provides an incremental solution to a stochastic optimization problem and OGD provides a solution to adopt a sequence of adversarially generated loss functions. We note that regret minimization algorithms equipped with online to batch conversion schemas discussed before settle an efficient paradigm to solve general optimization problems, but sometimes it seems essential to go beyond this barrier to obtain optimal convergence rates in stochastic setting [72, 125]. Remark 2.22. It is remarkable that in stark contrast to deterministic optimization where the smoothness of objective function makes a significant improvement in terms of convergence rate (i.e., Theorem 2.16), in stochastic optimization the smoothness is not a desirable property as it yields the same convergence rate as the Lipschitz functions. In particular, as it has been shown in Appendix A.3, a tight analysis of stochastic mirror descent algorithm has 2 σR √ an O( βR T + T ) convergence rate for smooth objective functions, which is dominated by the √ slow O(1/ T ) rate unless the variance of stochastic gradients becomes zero σ = 0. As it will be discussed in Chapter 7, the mixed optimization paradigm we introduce in thesis is able to leverage the smoothness of objection function to attain an O(1/T ) rate by accessing the full gradient oracle log T times on top of the O(T ) accesses of the stochastic gradient oracle. 2.3.4 Convex Optimization for Learning Problems Formulating statistical learning tasks and in particular convex learning problems as a convex optimization problem makes an intimate connection between learning and mathematical optimization. Therefore, optimization methods play a central role in solving machine learning 74 problems and challenges exist in machine learning applications demand the development of new optimization algorithms. To see this, consider the typical problem of the supervised learning consisting of a input space Ξ = X × Y and a suitable set of hypotheses W for prediction such as the set of linear predictors, i.e., W = {x → ⟨w, x⟩ : w ∈ Rd }. Then, the learner is provided with a training sample S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) ∈ Ξn and is supposed to pick a hypothesis w ∈ W which minimizes appropriate empirical cost over the training sample based on a predefined surrogate loss function ℓ : W × Ξ → R+ . The last step of this learning process corresponds to an optimization algorithm that solves the minimization problem of picking that hypothesis from the set of hypotheses. As a result, convex optimization forms the backbone of many algorithms for statistical learning. This formulation includes support vector machine (SVM), support vector regression (SVR), Lasso, logistic regression, and ridge regression among many others as detailed below: • Hinge loss (Support Vector Machine)): n ∑ max(0, 1 − yi ⟨w, xi ⟩). i=1 • Logistic loss (Logistic Regression): min w∈W • Least-squares loss (Regression): min w∈W log(1 + exp(−yi ⟨w, xi ⟩)). i=1 n ∑ w∈W • Exponential loss (Boosting): min n ∑ (yi − ⟨w, xi ⟩)2 . i=1 n ∑ exp(−yi ⟨w, xi ⟩). i=1 The domain W in above formulations, captures the constrains on the classifier w. Commonly considered examples are the bounded Euclidean ball W = {w ∈ Rd : ∥w∥2 ≤ R}, bounded ℓ1 ball W = {w ∈ Rd : ∥w∥1 ≤ B} or the box W = {w ∈ Rd : ∥w∥∞ ≤ B}. We note that instead of moving the constraint into the W, by leveraging on the theory of 75 Lagrangian method in constrained optimization, one can simply move the constraint into the objective and solve the unconstrained optimization problem, i.e., W = Rd . To fully understand the application of convex optimization methods to solving machine learning problems, let us consider the following optimization problem: 1∑ ℓ(w, (xi , yi )) n n min LS (w) = w∈W (2.13) i=1 A preliminary approach for solving the optimization problem in (2.13) is the batch gradient descent (GD) algorithm. It starts with some initial point, and iteratively updates the solution using the equation wt+1 = ΠW (wt − η∇LS (wt )) where 1∑ ∂ℓ(w, (xi , yi ))xi . n n ∇LS (w) = i=1 The main shortcoming of GD method is its high cost in computing the full gradient ∇LS (wt ), i.e., O(n) gradient computations, when the number of training examples is large. Stochastic gradient descent (SGD) alleviates this limitation of GD by sampling one (or a small set of) examples and computing a stochastic (sub)gradient at each iteration based on the sampled examples. Since the computational cost of SGD per iteration is independent of the size of the data (i.e., n), it is usually appealing for large-scale learning and optimization [29, 115, 134]. Despite of their slow rate of convergence compared with the batch methods, stochastic optimization methods have shown to be very effective for large scale and online learning problems, both theoretically [115, 94] and empirically [134]. We note although a large number of iterations is usually needed to obtain a solution of desirable accuracy, the lightweight 76 computation per iteration makes SGD attractive for many large-scale learning problems. 2.3.5 From Stochastic Optimization to Convex Learning Theory As mentioned earlier, most of existing learning algorithms follow the framework of empirical risk minimizer or regularized ERM, which was developed to great extent by Vapnik and Chervonenkis [146]. Essentially, ERM methods use the empirical loss over S, i.e., 1∑ ℓ(w, (xi , yi )), n n LS (w) = i=1 as a criterion to pick a hypothesis. From optimization viewpoint, the ERM methods resembles the widely used Sample Average Approximation (SAA) method in the optimization community when the hypothesis space and the loss function are convex. If uniform convergence holds, then the empirical risk minimizer is consistent, i.e., the population risk of the ERM converges to the optimal population risk, and the problem is learnable using ERM. A rather different paradigm for risk minimization is stochastic optimization. Recall that the goal of learning is to approximately minimize the risk LD (w) = E(x,y)∼D [ℓ(w, (x, y))]. (2.14) However, since the distribution D is unknown to the learner, we can not utilize standard gradient methods to directly minimize the expected loss in (2.14). This is because we are not able to compute the gradient ∇LD (w) at a particular query point w. We note that this is different from the application of SGD for solving the optimization problem in (2.13) because in (2.13) the randomness is over the uniform sampling from the objective function 77 which is known (essentially we have a randomized optimization method), while in (2.14) the randomness is imposed on the instance space Ξ through a distribution D which is unknown to the learner in advance. In stochastic optimization all we need is not the exact gradient of objective function, but an unbiased estimate of the true gradient ∇LD (w). Surprisingly, it turns out that the construction of this unbiased estimate is extremely simple for risk minimization as follows. First, we sample an instance z = (xi , yi ) ∈ Ξ according to D and set the stochastic gradient to be g = ∂ℓ(w, (xi , yi ))xi , which will be an unbiased estimate of true gradient, i.e., E[g] = ∇LD (w). The beauty of SGD for direct risk minimization is that it is efficient and it delivers the same sample complexity as the ERM method. To motivate stochastic optimization as an alternative to the ERM method, [132, 131] challenged the ERM method and showed that there is a real gap between learnability and uniform convergence by investigating nontrivial problems where no uniform convergence holds, but they are still learnable using SGD algorithm [115]. These results uncovered an important relationship between learnability and stability, and showed that stability together with approximate empirical risk minimization, assures learnability [133]. Unlike ERM method in which the learnability is characterized by attendant complexity of hypothesis space, in SGD based learning, stability is a general notion to characterize learnability. In particular, in learning setting under i.i.d. samples where uniform convergence is not necessary for learnability, but where stability is both sufficient and necessary for learnability. 78 Chapter 3 Passive Learning with Target Risk The setup of this chapter will be in the classical statistical learning setting discussed in Chapter 2, but with a slight modification. In particular, we assume that the target expected loss, also referred to as target risk, is provided in advance for learner as prior knowledge. Unlike most studies in the learning theory that only incorporate the prior knowledge into the generalization bounds, we are able to explicitly utilize the target risk in the learning process. By leveraging on the smoothness of loss function, our analysis reveals a surprising result on the sample complexity of learning: by exploiting the target risk in the learning algorithm, we show that when the loss function is both smooth and strongly convex, the ( ) sample complexity reduces to O(log 1ϵ ), an exponential improvement compared to the sample complexity O( 1ϵ ) for learning with strongly convex loss functions. Furthermore, our proof is constructive and is based on a computationally efficient stochastic optimization algorithm, dubbed ClippedSGD, for such settings which demonstrate that the proposed algorithm is practically useful. The remainder of the chapter is organized as follows: Section 3.1 motivates the problem and setups the notation. Section 3.2 motivates the main intuition behind the proposed algorithm. The proposed ClippedSGD algorithm and main result on its sample complexity are discussed in Section 3.3. The proof of logarithmic sample complexity is given in Section 3.4 and the omitted proofs are deferred to Section 3.5. Section 3.6 summarizes the chapter and Section 3.7 surveys the related works. 79 3.1 Setup and Motivation Recall that in the standard statistical or passive supervised learning setting, we consider an input space Ξ ≡ X × Y where X ⊆ Rd is the space for instances and Y is the set of labels, and a hypothesis class H from which we choose a classifier. We assume that the domain space Ξ is endowed with an unknown probability measure D and measure the performance of a specific hypothesis h by defining a nonnegative loss function ℓ : H × Ξ → R+ . The risk of a hypothesis h with respect to the underlying distribution D is defined as: LD (h) = Ez∼D [ℓ(h, z)]. Given a sample S = (z1 , · · · , zn ) = ((x1 , y1 ), · · · , (xn , yn )) ∼ Ξn , the goal of a learning algorithm is to pick a hypothesis h : X → Y from H in such a way that its risk LD (h) is close to the minimum possible risk of a hypothesis in H. In the new setting we consider for learning here, we assume that before the start of the learning process, the learner has in mind a target expected loss, also referred to as target risk, denoted by ϵprior 1 , and tries to learn a classifier with the expected risk of O(ϵprior ) by labeling a small number of training examples. We further assume the target risk ϵprior is feasible, i.e., ϵprior ≥ ϵopt where ϵopt = minh∈H LD (h). To address this problem, we develop an efficient algorithm, based on stochastic optimization, for passive learning with target risk. The most surprising property of the proposed algorithm is that when the loss function is both smooth and strongly convex, it only needs O(d log(1/ϵprior )) labeled examples to find a classifier with the expected risk of O(ϵprior ), where d is the dimension of data. This is a 1 We use ϵ prior instead of ϵ to emphasize the fact that this parameter is known to the learner in advance. 80 significant improvement compared to the sample complexity for empirical risk minimization. We note that the target risk assumption is fully exploited by the learning algorithm and stands in contrast to all those assumptions such as the nature of unknown distribution D, sparsity, and margin that usually enter into the generalization bounds and are often perceived as a rather crude way to incorporate such assumptions. The key intuition behind the ClippedSGD algorithm is that by knowing target risk as prior knowledge, the learner has better control over the variance in stochastic gradients, which contributes mostly to the slow convergence in stochastic optimization and consequentially large sample complexity in passive learning. The trick is to run the stochastic optimization in multiple stages with a fixed size and decrease the variance of stochastically perturbed gradients at each iteration by a properly designed mechanism. Another crucial feature of the proposed algorithm is to utilize the target risk ϵprior to gradually refine the hypothesis space as the algorithm proceeds. Our algorithm differs significantly from standard stochastic optimization algorithms and is able to achieve a geometric convergence rate with the knowledge of target risk ϵprior . To analyze the sample complexity of ClippedSGD algorithm, we pursue the stochastic optimization viewpoint for risk minimization detailed in Chapter 2. Precisely, we focus on the convex learning problems for which we assume that the hypothesis class H is a parametrized convex set H = {hw : x → ⟨w, x⟩ : w ∈ Rd , ∥w∥ ≤ R} and for all z = (x, y) ∈ Ξ, the loss function ℓ(·, z) is a non-negative convex function. Thus, in the remainder we simply use vector w to represent hw , rather than working with hypothesis hw . We will assume throughout that X ⊆ Rd is the unit ball so that ∥x∥ ≤ 1. Finally, the conditions under which we can get the desired result on sample complexity depend on analytic properties of the loss function. In particular, we assume that the loss function is strongly convex and smooth as defined in 81 Chapter 2 and can be found in Appendix ??. We would like to emphasize that in our setting, we only need that the expected loss function LD (w) be strongly convex, without having to assume strong convexity for individual loss functions. 3.2 The Curse of Stochastic Oracle We begin by discussing stochastic optimization for risk minimization, convex learnability, and then the main intuition that motivates the proposed algorithm. As mentioned earlier in Chapter 2, most existing learning algorithms follow the framework of empirical risk minimizer (ERM) or regularized ERM methods that use the empirical loss over S, i.e., LS (w) = n1 ∑n i=1 ℓ(w, zi ), as a criterion to pick a hypothesis. In regu- larized ERM methods, the learner picks a hypothesis that jointly minimizes LS (w) and a regularization function over w. A rather different paradigm for risk minimization is stochastic optimization. Recall that the goal of learning is to approximately minimize the risk LD (w) = Ez∼D [ℓ(w, z)]. However, since the distribution D is unknown to the learner, we can not utilize standard gradient methods to minimize the expected loss. Stochastic optimization methods circumvent this problem by allowing the optimization method to take a step which is only in expectation [ ] along the negative of the gradient. To directly solve minw∈H LD (w) = Ez∼D [ℓ(w, z)] , a typical stochastic optimization algorithm initially picks some point in the feasible set H and iteratively updates these points based on first order perturbed gradient information about the function at those points. For instance, the widely used SGD algorithm starts with w0 = 0; at each iteration t, it queries the stochastic oracle Os at wt to obtain a perturbed 82 but unbiased gradient gt and updates the current solution by wt+1 = ΠH (wt − ηt gt ) , where ΠH (·) projects the solution w into the domain H. To capture the efficiency of optimization procedures in a general sense, one can use oracle complexity of the algorithm which, roughly speaking, is the minimum number of calls to any oracle needed by any method to achieve desired accuracy [121]. We note that the oracle complexity corresponds to the sample complexity of learning from the stochastic optimization viewpoint previously discussed. This viewpoint for learning theory has been taken by few very recent works [132, 131] where the ERM method has been challenged and it has been shown that there is a real gap between learnability and uniform convergence. This has been done by investigating non-trivial problems where no uniform convergence holds, but they are still learnable using SGD algorithm. These results uncovered an important relationship between learnability and stability, and showed that stability together with approximate empirical risk minimization, assures learnability [133]. Unlike ERM method in which the learnability is characterized by attendant complexity of hypothesis space, in SGD based learning, stability is a general notion to characterize learnability. In particular, in learning setting under i.i.d. samples where uniform convergence is not necessary for learnability, but where stability is both sufficient and necessary for learnability. To motivate the main intuition behind the proposed method, we begin by stating the following theorem which provides a lower bound on the sample complexity of stochastic optimization algorithms that is taken from [119]. Theorem 3.1 (Lower Bound on Oracle Complexity). Suppose LD (w) = Ez∼D [ℓ(w, z)] 83 is α-strongly and β-smooth convex function defined over convex domain H. Let Os be a stochastic oracle that for any point w ∈ H returns an unbiased estimate g, i.e., E[g] = [ ] ∇LD (w), such that E ∥g − ∇LD (w)∥2 ≤ σ 2 holds. Then for any stochastic optimization algorithm A to find a solution w with ϵ accuracy respect to the optimal solution w∗ , i.e., E [LD (w) − LD (w∗ )] ≤ ϵ, the number of calls to Os is lower bounded by (√ O(1) β log α ( β∥w0 − w∗ ∥2 ϵ ) σ2 + αϵ ) . (3.1) The first term in (3.1) comes from deterministic oracle complexity and the second term is due to noisy gradient information provided by stochastic oracle Os . As indicated in (3.1), the slow convergence rate for stochastic optimization is due to the variance in stochastic ( ) gradients, leading to at least O σ 2 /ϵ queries to be issued. We note that the idea of minibatch [44, 50], although it reduces the variance in stochastic gradients, does not reduce the oracle complexity. We close this section by informally presenting why logarithmic sample complexity is, in principle, possible, under the assumption that target risk is known to the learner A. To this end, consider the setting of Theorem 3.1 and assume that the learner A is given the prior accuracy ϵprior and is asked to find an ϵprior -accurate solution. If it happens that the variance [ ] of stochastic oracle Os has the same magnitude as ϵprior , i.e., E ∥g − ∇LD (w)∥2 ≤ ϵprior , then from (3.1) it follows that the second term vanishes and the learner A needs to issue ) ( only O log 1/ϵprior queries to find the solution. But, since there is no control on the stochastic oracle Os , except that the variance of stochastic gradients are bounded, A needs a mechanism to manage the variance of perturbed gradients at each iteration in order to 84 alleviate the influence of noisy gradients. One strategy is to replace the unbiased estimate of gradient with a biased one, which unfortunately may yield loose bounds. To overcome this problem, we introduce a strategy that shrinks the solution space with respect to the target risk ϵprior to control the damage caused by biased estimates. As an illustrative example to see how the knowledge of target risk is helpful, we consider a simple one dimensional regression problem with loss function ℓ(w, x) = (wx−b)2 where b is a random variable that can either be δ or 1 with Pr[b = δ] = 1 − δ 2 . Here we choose δ to be a very small value δ ≪ 1. The loss function is non-negative, smooth, and strongly convex and is appropriate for our setting. For this setting we have, ϵopt ≤ Eb [ℓ(0)] = δ 2 ×1+(1−δ 2 )×δ 2 ≤ 2δ 2 which can be arbitrarily small. For this example, the solution obtained by ERM with a small number of training examples will be on order of δ and therefore its expected risk will be on the order of δ 2 . However, from the viewpoint of the learner, this expected risk is unknown unless the learner could figure out Pr(b = 1) = δ 2 , which unfortunately requires an order of 1/δ 2 samples. On the other hand, by having the target feasible risk as prior knowledge the learner is able to find out Pr(b = 1) with a small number of samples. 3.3 The ClippedSGD Algorithm In this section we proceed to describe the proposed algorithm and state the main result on its sample complexity. 3.3.1 The Algorithm Description We now turn to describing our algorithm. Interestingly, our algorithm is quite dissimilar to the classic stochastic optimization methods. It proceeds by running the algorithm online on 85 fixed chunks of examples, and using the intermediate hypotheses and target risk ϵprior to gradually refine the hypothesis space. As mentioned above, we assume in our setting that the target expected risk ϵprior is provided to the learner a priori. We further assume the target risk ϵprior is feasible for the solution within the domain H, i.e., ϵprior ≥ ϵopt . The proposed algorithm explicitly takes advantage of the knowledge of expected risk ϵprior to ( ) attain an O log(1/ϵprior ) sample complexity. Throughout we shall consider linear predictors of form ⟨w, x⟩ and assume that the loss function of interest ℓ(⟨w, x⟩, y) is β-smooth. It is straightforward to see that LD (w) = E(x,y)∼D [ℓ(⟨w, x⟩, y)] is also β-smooth. In addition to the smoothness of the loss function, we also assume that LD (w) to be α-strongly convex. We denote by w∗ the optimal solution that minimizes LD (w), i.e., w∗ = arg minw∈H LD (w), and denote its optimal value by ϵopt . Let (xt , yt ), t = 1, . . . , T be a sequence of i.i.d. training examples. The proposed algorithm divides the T iterations into the m stages, where each stage consists of T1 training examples, i.e., T = mT1 . Let (xtk , ykt ) be the tth training example received at stage k, and let η be the step size used by all the stages. At the beginning of each stage k, we initialize the solution w by the average solution wk obtained from the last stage, i.e., T1 1 ∑ wk = wkt , T1 (3.2) t=1 where wkt denotes the tth solution at stage k. Another feature of the proposed algorithm is a domain shrinking strategy that adjusts the domain as the algorithm proceeds using intermediate hypotheses and target risk. We define the domain Hk used at stage k as Hk = {w ∈ H : ∥w − wk ∥ ≤ ∆k } , 86 (3.3) where ∆k is the domain size, whose value will be discussed later. Similar to the SGD method, at each iteration of stage k, we receive a training example (xtk , ykt ), and compute ( ) the gradient gkt = ℓ′ ⟨wkt , xtk ⟩, yt xtk . Instead of using the gradient directly, a clipped version ) ( of the gradient, denoted by vkt = clip γk , gkt , will be used for updating the solution. More specifically, the clipped vector vkt ∈ Rd is defined as ( [ ]) [ ] ) ([ ] ) ( [vkt ]i = clip γk , gkt , i = 1, . . . , d = sign gkt min γk , gkt i i i (3.4) where γk = 2ξβ∆k with ξ ≥ 1. Given the clipped gradient vkt , we follow the standard framework of stochastic gradient descent, and update the solution by wkt+1 = ΠH k ( ) wkt − ηvkt . (3.5) The purpose of introducing the clipped version of the gradient is to effectively control the variance in stochastic gradients, an important step toward achieving the geometric convergence rate. At the end of each stage, we will update the domain size by explicitly exploiting the target expected risk ϵprior as ∆k+1 = √ ε∆2k + τ ϵprior , (3.6) where ε ∈ (0, 1) and τ ∈ (0, 1) are two parameters, both of which will be discussed later. Algorithm 1 gives the detailed steps for the proposed method. The three important aspects of Algorithm 1, all crucial to achieve a geometric convergence rate, are highlighted as follows: • Each stage of the proposed algorithm is comprised of the same number of training 87 Algorithm 1 ClippedSGD Algorithm 1: Input: • step size η • stage size T1 • number of stages m • target expected risk ϵprior • parameters ε ∈ (0, 1) and τ ∈ (0, 1) used for updating domain size ∆k • parameter ξ ≥ 1 used to clip the gradients 2: Initialize: w1 = 0, ∆1 = R, and H1 = H 3: for k = 1, . . . , m do 4: Set wkt = wk and γk = 2ξβ∆k 5: for t = 1, . . . , T1 do 6: Receive training example (xt , yt ) 7: Compute the gradient gkt and 8: Clip the gradient gkt to vkt using [vkt ]i = sign 9: 10: 11: 12: 13: 14: ([ gkt ]) i ( [ ] ) , i = 1, . . . , d min γk , gkt i ( ) Update the solution by wkt+1 = ΠH wkt − ηvkt k end for Update ∆k using (3.6). Compute the average solution wk+1 according to (3.2) Shrink the domain Hk+1 using the expression in (3.3). end for examples. This is in contrast to the epoch gradient algorithm [72] which divides m iterations into exponentially increasing epochs, and runs SGD with averaging on each epoch. Also, in our case the learning rate is fixed for all iterations. • The proposed algorithm uses a clipped gradient for updating the solution in order to better control the variance in stochastic gradients; this stands in contrast to the SGD method, which uses original gradients to update the solution. • The proposed algorithm takes into account the targeted expected risk and intermediate hypotheses when updating the domain size at each stage. The purpose of domain shrinking is to reduce the damage caused by biased gradients that resulted from clipping 88 operation. 3.3.2 Main Result on Sample Complexity The main theoretical result on the performance of the ClippedSGD algorithm is given in the following theorem. Theorem 3.2 (Convergence Rate). Assume that the hypothesis space H is compact and the loss function ℓ is α-strongly convex and β-smooth. Let T = mT1 be the size of the sample and ϵprior be the target expected loss given to the learner in advance such that ϵopt ≤ ϵprior holds. Given ε ∈ (0, 1) and τ ∈ (0, 1), set ξ, η, and T1 as 4β ξ= , T1 = 4 max ατ { } √ ξ 3 βd + 2ξβ d ms 16ξ 2 β 2 1 √ , ln , 2 2 , η= εα δ α ε 2ξβ T1 where ⌈ ⌉ ξβR2 s = log2 . ϵprior (3.7) After running Algorithm 1 over m stages, we have, with a probability 1 − δ, ( ) βR2 m τ LD (wm+1 ) ≤ ε + 1+ ϵ , 2 1 − ε prior implying that only O(d log[1/ϵprior ]) training examples are needed in order to achieve a risk of O(ϵprior ). We note that comparing to the bound in Theorem 3.1, for Algorithm 1 the level of error to which the linear convergence holds is not determined by the noise level in stochastic 89 gradients, but by the target risk. In other words, the algorithm is able to tolerate the noise by knowing the target risk as prior knowledge and achieves a linear convergence to the level of the target risk even when the variance of stochastic gradients is much larger than the target risk. In addition, although the result given in Theorem 3.2 assumes a bounded domain with ∥w∥ ≤ R, however, this assumption can be lifted by effectively exploring the strong convexity of the loss function and further assuming that the loss function is Lipschitz continuous with constant G, i.e., |LD (w1 ) − LD (w2 )| ≤ G∥w1 − w2 ∥, ∀ w1 , w2 ∈ H. More specifically, the fact that the LD (w) is α-strongly convex with first order optimality condition, from Lemma A.14 for the optimal solution w∗ = arg minw∈H LD (w), we have LD (w) − LD (w∗ ) ≥ α ∥w − w∗ ∥2 , ∀w ∈ H. 2 This inequality combined with Lipschitz continuous assumption implies that for any w ∈ H the inequality ∥w − w∗ ∥ ≤ R∗ := 2G/α holds, and therefore we can simply set R = R∗ . We also note that this dependency can be resolved with a weaker assumption than Lipschitz continuity, which only depends on the gradient of loss function at origin. To this end, we define |ℓ′ (0, y)| = G. Using the fact that LD (w) is α-strongly, it is easy to verify that α ∥w ∥2 − G∥w ∥ ≤ 0, leading to ∥w ∥ ≤ R := 2 G and, therefore, we can simply set ∗ ∗ ∗ ∗ α 2 R = R∗ . We now use our analysis of Algorithm 1 to obtain a sample complexity analysis for learning smooth strongly convex problems with a bounded hypothesis class. To make it easier to parse, we only keep the dependency on the main parameters d, α, β, T , and ϵprior and hide the dependency on other constants in O(·) notation. Let w denote the output of Algorithm 1. By setting ε = 0.5 and letting c = O(τ ) to be an arbitrary small number, 90 Theorem 3.2 yields the following: Corollary 3.3 (Sample Complexity). Under the same conditions as Theorem 3.2, by running Algorithm 1 for minimizing LD (w) with a number of iterations (i.e., number of training examples) T , if it holds that, ( ( 4 T ≥ O dκ log 1 1 log log + log ϵprior ϵprior δ 1 )) where κ = β/α denotes the condition number of the loss function and d is the dimension of data, then with a probability 1 − δ, w attains a risk of O(ϵprior ), i.e., LD (w) ≤ (1 + c)ϵprior . As an example of a concrete problem that may be put into the setting of the present work is the regression problem with squared loss. It is easy to show that average square loss function is Lipschitz continuous with a Lipschitz constant β = λmax (X ⊤ X) which denotes the largest eigenvalue of matrix X ⊤ X where X is the data matrix. The strong convexity is guaranteed as long as the population data covariance matrix is not rank-deficient and its minimum eigenvalue is lower bounded by a constant α > 0. For this problem, the optimal minimax sample complexity is known to be O( 1ϵ ), but as it implies from Corollary 3.3, by the knowledge of target risk ϵprior , it is possible to reduce the sample complexity to O(log(1/ϵprior )). Remark 3.4. It is indeed remarkable that the sample complexity of Theorem 3.2 has κ4 = (β/α)4 dependency on the condition number of the loss function, which is worse than the √ β/α dependency in the lower bound in (3.1). Also, the explicit dependency of sample complexity on dimension d makes the proposed algorithm inappropriate for non-parametric settings. 91 3.4 Analysis of Sample Complexity Now we turn to proving the main theorem. The proof will be given in a series of lemmas and theorems where the proof of few are given in the Section 3.5. The proof makes use of the Bernstein inequality for martingales, idea of peeling process, self-bounding property of smooth loss functions, standard analysis of stochastic optimization, and novel ideas to derive the claimed sample complexity for the proposed algorithm. The proof of Theorem 3.2 is by induction and we start with the key step given in the following theorem. Theorem 3.5. Assume ϵprior ≥ ϵopt . For a fixed stage k, if ∥wk − w∗ ∥ ≤ ∆k , then, with a probability 1 − δ, we have ∥wk+1 − w∗ ∥2 ≤ a∆2k + bϵprior where a= [ √ ] s) √ 2 ( 2ξβ T1 + ξ 3 βd + 2ξβ d ln , αT1 δ b= 8 αξ (3.8) √ and s is given in (3.7), provided that ξ ≥ 16β/α and η = 1/(2ξβ T1 ) hold. Taking this statement as given for the moment, we proceed with the proof of Theorem 3.2, returning later to establish the claim stated in Theorem 3.5. Proof of Theorem 3.2. By setting a and b in (3.8) in Theorem 3.5 as a ≤ ε and b ≤ 2τ /β, we have ξ ≥ 4β/(ατ ) and T1 ≤ [ √ ] s) √ 2 ( 2ξβ T1 + ξ 3 βd + 2ξβ d ln αε δ 92 implying that { T1 ≥ 4 max } √ ξ 3 βd + 2ξβ d s 16ξ 2 β 2 . ln , 2 2 εα δ α ε Thus, using Theorem 3.5 and the definition of ξ and T1 , we have, with a probability 1 − δ, ∆2k+1 ≤ ε∆2k + 2τ ϵ . β prior After m stages, with a probability 1 − mδ, we have 2τ ∆2m+1 ≤ εm ∆21 + ϵprior β m−1 ∑ i=0 εi ≤ εm ∆21 + 2τ ϵ . β(1 − ε) prior By the β-smoothness of LD (w), it implies that LD (wm+1 ) − LD (w∗ ) ≤ β β m 2 τ ∥wm+1 − w∗ ∥2 ≤ ε ∆1 + ϵ , 2 2 1 − ε prior βR2 m τ ≤ ε + ϵ , 2 1 − ε prior where the last inequality follows from ∆1 ≤ R. The bound stated in the theorem follows the assumption that LD (w∗ ) = ϵopt ≤ ϵprior . We now turn to proving Theorem 3.5. To bound ∥wk+1 − w∗ ∥ in terms of ∆k , we start with the standard analysis of online learning. In particular, from the strong convexity 93 assumption of LD (w) and updating rule (3.5) we have, α LD (wkt ) − LD (w∗ ) ≤ ⟨∇LD (wkt ), wkt − w∗ ⟩ − ∥wkt − w∗ ∥2 2 α = ⟨vkt , wkt − w∗ ⟩ + ⟨∇LD (wkt ) − vkt , wkt − w∗ ⟩ − ∥wt − w∗ ∥2 2 t+1 t 2 2 ∥wk − w∗ ∥ − ∥wk − w∗ ∥ ηd ≤ + γk2 2η 2 α + ⟨∇LD (wkt ) − vkt , wkt − w∗ ⟩ − ∥wt − w∗ ∥2 , (3.9) 2 ≜v t k √ where the last step follows from ∥vkt ∥ ≤ γk d. By adding all the inequalities of (7.1) at stage k, we have T1 ∑ T LD (wkt ) − LD (w∗ ) t=1 T 1 1 ∑ ∥wk − w∗ ∥2 dη 2 α∑ t ≤ + γk T1 + vk − ∥wt − w∗ ∥2 2η 2 2 ≤ t=1 2 ∆k dη 2 α + γk T1 + Vk − Wk , 2η where Vk and Wk are defined as Vk = 2 t=1 (3.10) 2 ∑T 1 t t=1 vk and Wk = ∑T1 t 2 t=1 ∥wk − w∗ ∥ , respectively. In order to bound Vk , using the fact that ∇LD (wkt ) = Et [gkt ], we rewrite Vk as Vk = T1 ∑ ⟨−vkt + Et [vkt ], wkt − w∗ ⟩ + t=1 T1 ∑ t=1 ≜dt k [ ] ⟨Et gkt − Et [vkt ], wkt − w∗ ⟩ ≜et k = Dk + Ek , where Dk = ∑T 1 t t=1 dk and Ek = ∑T1 t t=1 ek which represent the variance and bias of the clipped gradient vkt , respectively. We now turn to separately upper bound each term. The following lemma bounds the variance term Dk using the Bernstein inequality for martingale. Its proof can be found in Section 3.5. 94 Lemma 3.6. For any L > 0 and µ > 0, we have ) ( ) ( ( √ ) s ϵprior T1 1 2 Pr Wk ≤ + Pr Dk ≤ Wk + Lγk d + γk ∆k d ln ≥1−δ 2µβ L δ where s is given by ⌈ ⌉ 8βµR2 s = log2 . ϵprior The following lemma bounds Ek using the self-bounding property of smooth functions and the proof is deferred to Section 3.5.2. Lemma 3.7. Ek ≤ 4T1 4β 4T 4β ϵopt + Wk ≤ 1 ϵprior + W . ξ ξ ξ ξ k Note that without the knowledge of ϵprior , we have to bound ϵopt by Ω(1), resulting in a very loose bound for the bias term Ek . It is knowledge of the target expected risk ϵprior that allows us to come up with a significantly more accurate bound for the bias term Ek , which consequentially leads to a geometric convergence rate. We now proceed to bound ∑T1 t t=1 LD (wk ) − LD (w∗ ) using the two bounds in Lemma 3.6 and 3.7. To this end, based on the result obtained in Lemma 3.6, we consider two scenarios. In the first scenario, we assume Wk ≤ ϵprior T1 2µβ (3.11) In this case, we have T1 ∑ LD (wkt ) − LD (w∗ ) ≤ t=1 95 ϵprior β Wk ≤ T . 2 2µ 1 (3.12) In the second scenario, we assume Dk ≤ ( √ ) s 1 WT + Lγk2 d + γk ∆k d ln . L δ (3.13) ξ In this case, by combining the bounds for Dk and Ek and setting L = 4β , we have ( ) √ 8β ξd 2 s 4T Vk ≤ Wk + γk + γk ∆k d ln + 1 ϵprior ξ 4β δ ξ ) ( √ 8β s 4T = Wk + ξ 3 βd + 2ξβ d ∆2k ln + 1 ϵprior , ξ δ ξ α where the last equality follows from the fact γk = 2ξβ∆k . If we choose ξ such that 8β ξ ≤ 2 or ξ ≥ 16β α > 1 holds, we get Vk ≤ ( √ ) α s 4T Wk + ξ 3 βd + 2ξβ d ∆2k ln + 1 ϵprior 2 δ ξ Substituting the above bound for Vk into the inequality of (3.10), we have T1 ∑ LD (wkt ) − LD (w∗ ) ≤ t=1 By choosing η as η = ( √ ) ∆2k η 2 s 4T + γk T1 + ξ 3 βd + 2ξβ d ∆2k ln + 1 ϵprior 2η 2 δ ξ ∆k 1 √ = √ , we have γk T 1 2ξβ T1 LD (wk+1 ) − LD (w∗ ) ≤ [ √ ] s) 2 4 √ 1 ( 2ξβ T1 + ξ 3 βd + 2ξβ d ln ∆k + ϵprior . T1 δ ξ (3.14) By combining the bounds in (3.12) and (3.14), under the assumption that at least one of the 96 two conditions in (3.11) and (3.13) is true, by setting µ = B/8, we have LD (wk+1 ) − LD (w∗ ) ≤ [ √ ] s) 2 4 √ 1 ( 2ξβ T1 + ξ 3 βd + 2ξβ d ln ∆k + ϵprior , T1 δ ξ implying ∥wk+1 − w∗ ∥ ≤ [ √ ] s) 2 √ 2 ( 8 2ξβ T1 + ξ 3 βd + 2ξβ d ln ∆k + ϵ . αT1 δ αξ prior We complete the proof by using Lemma 3.6, which states that the probability for either of the two conditions hold is no less than 1 − δ. 3.5 Proofs of Sample Complexity 3.5.1 Proof of Lemma 3.6 The proof is based on the Bernstein’s inequality for martingales which can be found in ⟨ ⟩ Lemma A.25. Define martingale difference dtk = wkt − w∗ , Et [vkt ] − vkt and martingale Dk = ∑T 1 t 2 t=1 dk . Let ΣT denote the conditional variance as Σ2T = T1 ∑ [ Et (dtk )2 ] ≤ t=1 ≤ T1 ∑ t=1 T ∑ [ Et ] 2 Et [vkt ] − vkt ∥wkt − w∗ ∥2 dγk2 ∥wkt − w∥2 = dγk2 Wk , t=1 which follows from the Cauchy’s Inequality and the definition of clipping. √ Define M = max |dtk | ≤ 2 dγk ∆k . To prove the inequality in Lemma 3.6, we follow the t 97 idea of peeling process [92]. Since Wk ≤ 4R2 T1 , we have ( ) √ √ Pr Dk ≥ 2γk Wk dρ + 2M ρ/3 ) ( √ √ 2 = Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Wk ≤ 4R T1 ) ( √ √ = Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Σ2T ≤ γk2 dWk , Wk ≤ 4R2 T1 ( ) √ √ ≤ Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Σ2T ≤ γk2 dWk , Wk ≤ ϵprior T1 /(2βµ) ( ) s i−1 T iT ∑ √ √ ϵ 2 ϵ 2 1 1 prior prior + Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Σ2T ≤ γk2 dWk , < Wk ≤ 2βµ 2βµ i=1 ) ( ϵprior T1 ≤ Pr Wk ≤ 2βµ   √ √ s 2 i 2 i+1 ∑ ϵprior 2 T1 γk d ϵprior 2 T1 γk d 2  + Pr Dk ≥ ρ+ M ρ, Σ2T ≤ 2βµ 3 2βµ i=1 ( ) ϵprior T1 ≤ Pr Wk ≤ + se−ρ , 2βµ where s is given by ⌉ ⌈ 8βµR2 . s = log2 ϵprior The last step follows the Bernstein inequality for martingales. We complete the proof by setting ρ = ln(s/δ) and using the fact that 2γk 3.5.2 √ 1 Wk ρd ≤ Wk + γk2 ρdL. L Proof of Lemma 3.7 To bound Ek , we need the following two lemmas. The first lemma bounds the deviation of the expected value of a clipped random variable from the original variable, in terms of its variance (Lemma A.2 from [74]). 98 Lemma 3.8. Let X be a random variable, let X = clip(X, C) and assume that |E[X]| ≤ C/2 for some C > 0. Then |E[X] − E[X]| ≤ 2 |Var[X]| C Another key observation used for bounding Ek is the fact that for any non-negative βsmooth convex function, we have the following self-bounding property. We note that this self-bounding property has been used in [138] to get better (optimistic) rates of convergence for non-negative smooth losses. Lemma 3.9. For any β-smooth non-negative function f : R → R, we have |f ′ (w)| ≤ √ 4βf (w) Proof. See Appendix ?? Proof of Lemma 3.7. To apply the above lemmas, we write etk as etk = d ∑ )] [ ( Et ℓ′ (⟨wkt , xtk ⟩, yt )[xtk ]i − clip γk , ℓ′ (⟨wkt , xtk ⟩, yt )[xtk ]i [wkt − w∗ ]i i=1 In order to apply Lemma 3.8, we check if the following condition holds ] [ ( ) γk ≥ 2 Et ℓ′ ⟨wkt , xtk ⟩, yt [xtk ]i (3.15) Since ≤ ) ] [ ( Et ℓ′ ⟨wkt , xtk ⟩, yt [xtk ]i [ ( ) ] ( )} ] ) [{ ( Et ℓ′ ⟨wkt , xtk ⟩, yt − ℓ′ ⟨w∗ , xtk ⟩, yt [xtk ]i + Et ℓ′ ⟨w∗ , xtk ⟩, yt [xtk ]i ≤ β∥wkt − w∗ ∥ ≤ β∆k 99 [ ( ) ] where the last inequality follows from Et ℓ′ ⟨w∗ , xtk ⟩, yt [xtk ]i = 0 since w∗ is the minimizer of LD (w), we thus have ) ] [ ( γk = 2ξβ∆k ≥ 2β∆k ≥ 2 Et ℓ′ ⟨wkt , xtk ⟩, yt [xtk ]i where ξ ≥ 1, implying that the condition in (3.15) holds. Thus, using Lemma 3.8, we have etk ≤ ≤ [ ( )2 1 [wkt − w∗ ]i Et ℓ′ (⟨wkt , xtk ⟩, yt )[xtk ]i γk i=1 [( )2 ] 2∥wkt − w∗ ∥∞ ′ t t Et ℓ (⟨wk , xk ⟩, yt ) γk d ∑ ] Using Lemma 3.9 to upper bound the right hand side, we further simplify the above bound for etk as )] 8β∥wkt − w∗ ∥∞ [ ( t t Et ℓ ⟨wk , xk ⟩, yt γk t 8β∥wk − w∗ ∥∞ LD (wkt ) = γk 8β∆k ≤ LD (wkt ) γk 4 L (wt ) = ξ D k etk ≤ where the second inequality follows from ∥wkt − w∗ ∥∞ ≤ ∥wkt − w∗ ∥ ≤ ∆k . Therefore we 100 obtain Ek = T1 ∑ t=1 4 etk ≤ ξ T1 ∑ LD (wkt ) t=1 T T t=1 t=1 T1 1 1 4∑ 4∑ = LD (w∗ ) + LD (wkt ) − LD (w∗ ) ξ ξ ≤ 4T1 4β ∑ t ∥wk − w∗ ∥2 LD (w∗ ) + ξ ξ t=1 4T1 4β = LD (w∗ ) + W , ξ ξ k where the second inequality follows from the smoothness assumption of LD (w). 3.6 Summary In this chapter, we have studied the sample complexity of passive learning when the target expected risk is given to the learner as prior knowledge. The crucial fact about target risk assumption is that, it can be fully exploited by the learning algorithm and stands in contrast to most common types of prior knowledges that usually enter into the generalization bounds and are often perceived as a rather crude way to incorporate such assumptions. We showed that by explicitly employing the target risk ϵprior in a properly designed stochastic optimization algorithm, it is possible to attain the given target risk ϵprior with a logarithmic ( ) 1 sample complexity log ϵ , under the assumption that the loss function is both strongly prior convex and smooth. There are various directions for future research. The current study is restricted to the parametric setting where the hypothesis space is of finite dimension. It would be interesting to see how to achieve a logarithmic sample complexity in a non-parametric setting where hypotheses lie in a functional space of infinite dimension. Evidently, it is impossible to extend 101 the current algorithm for the non-parametric setting; therefore additional analysis tools are needed to address the challenge of infinite dimension arising from the non-parametric setting. It is also an interesting problem to relate target risk assumption we made here to the low noise margin condition which is often made in active learning for binary classification since both settings appear to share the same sample complexity. However it is currently unclear how to derive a connection between these two settings. We believe this issue is worthy of further exploration and leave it as an open problem. 3.7 Bibliographic Notes Sample complexity of passive learning is well established and goes back to early works in ( ) ( ) the learning theory where the lower bounds Ω 1ϵ (log 1ϵ + log 1δ ) and Ω 12 (log 1ϵ + log 1δ ) ϵ were obtained in classic PAC and general agnostic PAC settings, respectively [52, 26, 7]. There has been an upsurge of interest over the last decade in finding tight upper bounds on the sample complexity by utilizing prior knowledge on the analytical properties of the loss function, that led to stronger generalization bounds in agnostic PAC setting. In [95] fast rates obtained for squared loss, exploiting the strong convexity of this loss function, which only holds under pseudo-dimensionality assumption. With the recent development in online strongly convex optimization [68], fast rates approaching O( 1ϵ log 1δ ) for convex Lipschitz strongly convex loss functions has been obtained in [140, 82]. For smooth non-negative loss functions, [138] improved the sample complexity to optimistic rates ( ( )( )) 1 ϵopt + ϵ 1 1 3 O + log log ϵ ϵ ϵ δ 102 for non-parametric learning using the notion of local Rademacher complexity [13], where ϵopt is the optimal risk. The proposed ClippedSGD algorithm is related to the recent studies that examined the learnability from the viewpoint of stochastic convex optimization. In [139, 133], the authors presented learning problems that are learnable by stochastic convex optimization but not by empirical risk minimization (ERM). Our work follows this line of research. The proposed algorithm achieves the sample complexity of O(d log(1/ϵprior )) by explicitly incorporating the target expected risk ϵprior into the stochastic convex optimization algorithm. It is however difficult to incorporate such knowledge into the framework of ERM. Furthermore, it is worth noting that in [127, 139, 126, 20], the authors explored the connection between online optimization and statistical learning in the opposite direction. This was done by exploring the complexity measures developed in statistical learning for the learnability of online learning. We note that our work does not contradict the lower bound in [138] because a feasible target risk ϵprior is given in our learning setup and is fully exploited by the proposed algorithm. Knowing that the target risk ϵprior is feasible makes it possible to improve the sample complexity from O(1/ϵprior ) to O(log(1/ϵprior )). We also note that although the logarithmic sample complexity is known for active learning [66, 12], we are unaware of any existing passive learning algorithm that is able to achieve a logarithmic sample complexity by incorporating any kind of prior knowledge. The proposed algorithm is also closely related to the recent works that stated O(1/n) is the optimal convergence rate for stochastic optimization when the objective function is strongly convex [76, 72, 125]. In contrast, the proposed algorithm is able to achieve a geometric convergence rate for a target optimization error. Similar to the previous argument, our result does not contradict the lower bound given in [72] because of the knowledge of a 103 feasible optimization error. Moreover, in contrast to the multistage algorithm in [72] where the size of stages increases exponentially, in our algorithm, the size of each stage is fixed to be a constant. 104 Chapter 4 Statistical Consistency of Smoothed Hinge Loss In Chapter 2 we discussed that convex surrogates of the 0-1 loss are highly preferred because of the computational and theoretical virtues that convexity brings in and most prominent practical methods studied in machine learning make significant use of convexity. This is of more importance if we consider smooth surrogates as witnessed by the fact that the smoothness is further beneficial both computationally- by attaining an optimal convergence rate for optimization, and in a statistical sense- by providing an improved optimistic rate for generalization bound. This chapter concerns itself with the statistical consistency of smooth convex surrogates. The statistical consistency finds general quantitative relationships between the excess risk errors associated with convex and those associated with 0-1 loss. Consistency results provide reassurance that optimizing a surrogate does not ultimately hinder the search for a function that achieves the binary excess risk, and thus allow such a search to proceed within the scope of computationally efficient algorithms. Statistical consistency of surrogates under conditions such as convexity is a well studied problem in learning community and quantitative relationships between binary risk and convex excess risk has been established. In this chapter we investigate the smoothness property from the viewpoint of statistical consistency and 105 show how it affects the binary excess risk. We show that in contrast to optimization and generalization errors that favor the choice of smooth surrogate loss, the smoothness of loss function may degrade the binary excess risk. Motivated by this negative result, we provide a unified analysis that integrates optimization error, generalization bound, and the error in translating convex excess risk into a binary excess risk when examining the impact of smoothness on the binary excess risk. We show that under favorable conditions appropriate √ choice of smooth convex loss will result in a binary excess risk that is better than O(1/ n). The reminder of this paper is organized as follows. In Section 4.1 we set up notation and describe the setting. Section 4.2 briefly discusses the classification-calibrated convex surrogate losses on which our analysis relies. We derive the ψ-transform for smoothed hinge loss and elaborate its binary excess risk in Section 4.3. Section 4.4 provides a unified analysis of three types of errors and derives conditions in terms of smoothness to obtain better rates for the binary excess risk. The omitted proofs are included in Section 4.5. Section 4.6 concludes the paper. 4.1 Motivation Let S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) be a set of i.i.d. samples drawn from an unknown distribution D over Ξ = X × {−1, +1}, where xi ∈ X ⊆ Rd is an instance and yi ∈ {−1, +1} is the binary class assignment for xi . Let κ(·, ·) be an universal kernel and let Hκ be the Reproducing Kernel Hilbert Space (RKHS) endowed with kernel κ(·, ·). According to [157], Hκ is a rich function space whose closure includes all the smooth functions. We consider predictors from Hκ with bounded norm to form the measurable function class H = {h ∈ Hκ : ∥h∥Hκ ≤ B}. 106 For a function h : X → R, the risk of h is defined as: LD (h) = E(x,y)∼D [I[yh(x) ≤ 0]] = P [yh(x) ≤ 0] . Let h∗ be the optimal classifier that attains the minimum risk, i.e. h∗ = arg min P [yh(x) ≤ 0] . h We assume h∗ ∈ Hκ with ∥h∗ ∥Hκ ≤ B. This boundedness condition is satisfied for any RKHS with a bounded kernel (i.e. supx∈X κ(x, x) ≤ B). Henceforth, let L∗D stand for the minimum achievable risk by the optimal classifier h∗ , i.e., L∗D = LD (h∗ ). Define the binary excess risk for a prediction function h ∈ H as E(h) = LD (h) − L∗D . Our goal is to efficiently learn a prediction function h ∈ H from the training examples in S that minimizes the binary excess risk E(h). Many studies of binary excess risk assume that the optimal classifier h ∈ H is learned by minimizing the empirical binary risk, ∑ minh∈H n1 n i=1 I[yi h(xi ) ≤ 0], an approach that is usually referred to as Empirical Risk Minimization (ERM) [145]. To understand the generalization performance of the classifier learned by ERM, it is important to have upper bounds on the excess risk of the empirical minimizer that hold with a high probability and that take into account complexity measures of classification functions. It is well known that, under certain conditions, direct empirical classification error minimization is consistent [145] and achieves a fast convergence rate under low noise situations [109]. 107 One shortcoming of the ERM based approaches is that they need to minimize 0-1 loss, leading to non-convex optimization problems that are potentially NP-hard 1 [8, 75]. A common practice to circumvent this difficulty is to replace the indicator function I[· ≤ 0] with some convex loss ϕ(·) and find the optimal solution by minimizing the convex surrogate loss. Examples of such surrogate loss functions for 0-1 loss include logit loss ϕlog (h; (x, y)) = log(1+exp(−yh(x))) in logistic regression [61], hinge loss ϕHinge (h; (x, y)) = max(0, 1−yh(x)) in support vector machine (SVM) [43] and exponential loss ϕexp (h; (x, y)) = exp(−yh(x)) in AdaBoost [58]. Given a convex surrogate loss function ϕ : R → R+ (e.g., hinge loss, exponential loss, or logistic loss) we define the risk with respect to the convex loss ϕ (convex risk or ϕ-risk) as ϕ LD (h) = E(x,y)∼D [ϕ(yh(x))]. ϕ,∗ Similarly we define the optimal ϕ-risk as LD = inf h∈H E(x,y)∼D [ϕ(yh(x))]. The excess ϕ-risk or convex excess risk of a classifier h ∈ H with respect to the convex surrogate loss ϕ(·) is defined as ϕ ϕ,∗ Eϕ (h) = LD (h) − LD . An important line of research in statistical learning theory focused on relating the convex excess risk Eϕ (h) to the binary excess risk E(h) that will be elaborated in next section. It is known that under mild conditions, the classifier learned by minimizing the empirical loss of convex surrogate is consistent to the Bayes classifier [156, 101, 80, 97, 141, 14]. For 1 We note that several works [84, 85] provide efficient algorithms for direct 0-1 empirical error minimization but under strong (unrealistic) assumptions on data distribution or label generation. 108 instance, it was shown in [14] that the necessary and sufficient condition for a convex loss ϕ(·) to be consistent with the binary loss is that ϕ(·) is differentiable at origin and ϕ′ (0) < 0. It was further established in the same work that the binary excessive risk can be upper bound by the convex excess risk through a ψ-transform that depends on the surrogate convex loss ϕ(·). Since the choice of convex surrogates could significantly affect the binary excess risk, in this chapter, we will investigate the impact of the smoothness of a convex loss function on the binary excess risk. This is motivated by the recent results that show the advantages of using smooth convex surrogates in reducing the optimization complexity and the generalization error bound. More specifically, [121, 143] show that a faster convergence rate (i.e., O(1/T 2 )) can be achieved by first order methods when the objective function to be optimized is convex and smooth such as accelerated gradient descent method introduced in Chapter 2; in [138], the authors show that a smooth convex loss will lead to a better optimistic generalization error bound rooted in the self-bounding property of smooth losses (Lemma A.12). Given the positive news of using smooth convex surrogates, an open research question is how the smoothness of a convex surrogate will affect the binary excess risk. The answer to this question, as will be revealed later, is negative: the smoother the convex loss, the poorer approximation will be for the binary excess risk. Thus, the second contribution of this work is to integrate these results for smooth convex losses, and examine the overall effect of replacing 0-1 loss with a smooth convex loss when taking into account three sources of errors, i.e. the optimization error, the generalization error, and the error in translating the convex excess risk into the binary risk. As we will show, under favorable conditions, appropriate √ choice of smooth convex loss will result a binary excess risk better than O(1/ n). 109 4.2 Classification Calibration and Surrogate Risk Bounds Although it is computationally convenient to minimize the empirical risk based on a convex surrogate, the ultimate goal of any classification method is to find a function h ∈ Hκ that minimizes the binary loss. Therefore, it is crucial to investigate the conditions which ϕ,∗ guarantee that if the ϕ-risk of h gets close to the optimal LD , the binary risk of h will also approach the optimal binary risk L∗D . This question has been an active trend in statistical learning theory over the last decade where the necessary and sufficient conditions have been established for relating the binary excess risk to a convex excess risk [156, 101, 80, 97, 141, 14]. In this chapter we follow the strategy introduced in [14] in order to relate the binary excess risk to the excess ϕ-risk. Their methodology, through the notion of classification calibration, allows us to find quantitative relationship between the excess risk associated with ϕ and the excess risk associated with 0-1 loss. It is established in [14] that the binary excessive risk can be bounded by the convex excess risk, based on the convex loss function ϕ, through a ψ-transform. Definition 4.1. Given a loss function ϕ : R → [0, ∞), define the function ψ : [0, 1] → [0, ∞) by ˜ ψ(z) = H− ( 1+z 2 ) ( −H 1+z 2 ) where H − (η) = inf α:α(2η−1)≤0 (ηϕ(α) + (1 − η)ϕ(−α)) and H(η) = inf (ηϕ(α) + (1 − η)ϕ(−α)) . α∈R ˜ The transform function ψ : [0, 1] → [0, ∞) is defined to be the convex closure of ψ. 110 The following theorem from [14, Theorem 1] shows that the binary excess risk can be bounded by the convex excess risk using transform function ψ : [0, 1] → [0, ∞) that depends on the surrogate convex loss function. Theorem 4.2. For any non-negative loss function ϕ(·), any measurable function h ∈ H, and any probability distribution D on X × Y, there is a nondecreasing function ψ : [0, 1] → [0, ∞) that ψ(LD (h) − L∗D ) ≤ LD (h) − LD ϕ ϕ,∗ (4.1) holds. Here the minimization is taken over all measurable functions. Definition 4.3. A convex loss ϕ is classification-calibrated if, for any η ̸= 1/2, H − (η) > H(η). This condition is essentially an extension of [156, Theorem 2.1] and can be viewed as a form of Fisher consistency that is appropriate for classification. It has been shown in [14] that the necessary and sufficient condition for a convex loss ϕ(z) to be classification-calibrated is if it is differentiable at the origin and ϕ′ (0) < 0. In particular, for a certain convex function ϕ(·), the ψ-transform can be computed by ( ψ(z) = inf αz≤0 ) ( ) 1−z 1+z 1−z 1+z ϕ(α) + ϕ(−α) − inf ϕ(α) + ϕ(−α) , 2 2 2 2 α∈R ( ) that can be further simplified as ψ(z) = ϕ(0) − H 1+z when ϕ is classification-calibrated. 2 Examples of ψ-transform for the convex surrogate functions of known practical algorithms 111 mentioned before are as follows: (i) for hinge loss ϕ(α) = max(0, 1 − α) , ψ(z) = |z|, (ii) for √ exponential loss ϕ(α) = e−α , ψ(z) = 1 − 1 − z 2 ≥ z 2 /2, and (iii) for truncated quadratic loss ϕ(α) = [max(0, 1 − α)]2 , ϕ(z) = z 2 . Remark 4.4. We note that the inequality in (4.1) provides insufficient guidance on choosing appropriate loss function. A few brief comments are appropriate. First, it does not measure ϕ ϕ,∗ explicitly how the choice of the convex surrogate ϕ(·) affects the excess risk LD (h) − LD . Second, it does not take into account the impact of loss function on optimization efficiency, an important issue for practitioners when dealing with big data. It is thus unclear, from Theorem 4.2, how to choose an appropriate loss function that could result in a small generalization error for the binary loss when the computational time is limited. In this chapter, we address these limitations by examining a family of convex losses that are constructed by smoothing the hinge loss function using different smoothing parameters. We study the binary excessive risk of the learned classification function by taking into account errors in optimization, generalization, and translation of convex excess risk into binary excess risk. 4.3 Binary Excess Risk for Smoothed Hinge Loss As stated before, to efficiently learn a prediction function h ∈ H, we will replace the binary loss with a smooth convex loss. Since hinge loss is one of the most popular loss functions used in machine learning and is the loss of choice for classification problems in terms of the margin error [19], in this work, we will focus on the smoothed version of the hinge loss. Another advantage of using the hinge loss is that its ψ-transform is a linear function. Compared with the ψ-transforms of other popular convex loss functions (e.g. exponential loss and truncated square loss) that are mostly quadratic, using the hinge loss as convex surrogate will lead to 112 a tighter bound for the binary excess risk. The smoothed hinge loss considered in this chapter is defined as 1 ϕ(z; γ) = max α(1 − z) + R(α), γ α∈[0,1] (4.2) where R(α) = −α log α − (1 − α) log(1 − α) and γ > 0 is the smoothing parameter. It is straightforward to verify that the loss function in (4.2) can be simplified as ϕ(z; γ) = 1 log(1 + exp(γ(1 − z))). γ It is not immediately clear from Theorem 4.2 how the relationship between smooth convex excess risk Eϕ (·) and binary excess risk is affected by the smoothness parameter γ. In addition, as discussed in [14], whereas conditions such as convexity and smoothness have natural relationship to optimization and generalization, it is not immediately obvious how properties such as convexity and smoothness of convex surrogate relates to statistical consequences. In what follows, we show that, indeed smoothness of loss function has a negative statistical consequence and can degrade the binary excess risk. 4.3.1 ψ-Transform for Smoothed Hinge Loss The first step in our analysis is to derive the ψ-transform for the loss function defined in (4.2) as stated in the following theorem. Theorem 4.5. The ψ-transform of smoothed hinge loss with smoothing parameter γ is given 113 by 1+η ψ(η; γ) = − log 2γ ( [ ]) ( [ ]) 1 1−η 1 C1 C2 γ γ 1+e − log 1+e 1 + eγ 1+η 2γ 1 + eγ 1−η where C1 and C2 are defined as C1 = −ηeγ + √ √ η 2 e2γ + 1 − η 2 and C2 = ηeγ + η 2 e2γ + 1 − η 2 . The ψ-transform given in Theorem 4.5 is too complicated to be useful. The theorem below provides a simpler bound for the ψ-transform in terms of the smoothness parameter γ. Theorem 4.6. For η ∈ (−1, 1), we have ψ(η; γ) ≥ |η| − 1 1 log . γ |η| Remark 4.7. The bound obtained in Theorem 4.6 demonstrates that when γ approaches to infinity, the ψ-transform for smoothed hinge loss ϕ(η; γ) becomes |η|. According to [14], the ψ-transform for the hinge loss is ψ(η) = |η|. Therefore, this result is consistent with the ψ-transform for smoothed hinge loss, which is the limit of ϕ(z; γ) as γ approaches infinity. 4.3.2 Bounding E(h) based on Eϕ (h) Based on the transform function ψ(·; γ) that is computed for smoothed hinge loss with smoothing parameter γ, we are now in the position to bound its corresponding binary excess risk E(h). Our main result in this section is the following theorem that shows how binary excess risk can be bounded by the excess ϕ-risk for smoothed hinge loss. Theorem 4.8. Consider any measurable function h ∈ H and the smoothed hinge loss ϕ(·) with parameter γ defined in (4.2). Then, binary excess risk E(h) can be bounded by the 114 smooth convex excess risk Eϕ (h) as E(h) ≤ Eϕ (h) + Eϕ (h) 1 log . 1 + γEϕ (h) Eϕ (h) Proof. Using the result from Theorem 4.2, we have Eϕ (h) ≥ ψ(E(h); γ) and therefore an immediate result from the ψ-transform for smoothed hinge loss that is obtained in Theorem 4.6 indicates E(h) + 1 log E(h) ≤ Eϕ (h). γ Define ∆ = E(h) − Eϕ (h). We have ( ) 1 1 ∆ 1 ≤ 0. ∆ + log(∆ + Eϕ (h)) = ∆ + log Eϕ (h) + log 1 + γ γ γ Eϕ (h) Based on the log(1 + x) ≤ x inequality, the sufficient condition for the above inequality to hold is to have ∆+ ∆ 1 1 ≤ log γEϕ (h) γ Eϕ (h) and therefore ∆≤ Eϕ (h) γ −1 1 1 log = log . Eϕ (h) 1 + γEϕ (h) Eϕ (h) 1 + (γEϕ (h))−1 The final bound is obtained by substituting E(h) − Eϕ (h) for ∆ in the left hand side of above inequality. As indicated by Theorem 4.8, the smaller the smoothing parameter γ, the poorer the approximation is in bounding the binary excess E(h) with smooth convex excess risk Eϕ (h). On the other hand, the smoothness of loss function has been proven to be beneficial in terms 115 of optimization error and generalization bound. The mixture of negative and positive results for using smooth convex surrogates motivates us to develop an integrated bound for binary excess risk that takes into account all types of errors. One of the main contributions of this work is to show that under favorable conditions, with appropriate choice of smoothing parameter, the smoothed hinge loss will result in a bound for the binary excess risk better √ than O(1/ n). 4.4 A Unified Analysis Using the smoothed hinge loss, we define the convex loss for a prediction function h ∈ H as LD (h) = E[ϕ(yh(x); γ)]. Let h∗γ be the optimal classifier that minimizes LD (h). Similar to ϕ ϕ the case of binary loss, we assume h∗γ ∈ Hκ with ∥h∗γ ∥ ≤ B. The smooth convex excess risk for a given prediction function h ∈ H is then given by Eϕ (h) = LD (h) − LD (h∗γ ). Given the ϕ ϕ smooth convex loss ϕ(z; γ) in (4.2), we find the optimal classifier by minimizing the empirical ϕ ϕ convex loss, i.e. minh∈Hκ ,∥h∥ ≤B LS (h), where the empirical convex loss LS (h) is given Hκ by 1∑ ϕ(yi h(xi ); γ). n n ϕ LS (h) = (4.3) i=1 Let h be the solution learned from solving the empirical convex loss over training examples. There are three sources of errors that affect bounding the binary excess risk E(h). First, since h is obtained by numerically solving an optimization problem, the error in estimating the optimal solution, which we refer to as optimization error 2 , will affect E(h). Additionally, 2 We note that in literature the error in estimating the optimal solution for empirical minimization is usually referred to as estimation error. We emphasize it as optimization error because different convex surrogates could lead to very different iteration complexities and consequentially different optimization efficiency. 116 since the binary excess risk can be bounded by a nonlinear transform of the convex excess risk, both the bound for Eϕ (h) and the error in approximating E(h) with Eϕ (h) will affect the final estimation of E(h). We aim at investigating how the smoothing parameter γ affect all these three types of errors. As it is investigated in Theorem 4.8, a smaller smoothing parameter γ will result in a poorer approximation of E(h). On the other hand, a smaller smoothing parameter γ will result in a smaller estimation error and a smaller bound for Eϕ (h). Based on the understanding of how smoothing parameter γ affects the three errors, we identify the choice of γ that results in the best tradeoff between all three error and √ consequentially a binary excess risk E(h) better than O(1/ n). To investigate how the smoothing parameter γ affects the binary excess risk E(h), we intend to unify three types of errors. The analysis is comprised of two components, i.e. bounding the binary excess risk E(h) by a smooth convex excess risk Eϕ (h) that has been established in Theorem 4.8 and bounding Eϕ (h) for a solution h that is suboptimal in miniϕ mizing the empirical convex loss LS (h) that is the focus of this section. 4.4.1 Bounding Smooth Excess Convex Risk Eϕ (h) We now turn to bounding the excess ϕ-risk Eϕ (h) for the smoothed hinge loss. To bound Eϕ (h) we need to consider two types of errors: optimization error due to the approximate optimization of the empirical ϕ-risk, and the generalization error bound for the empirical risk minimizer. After obtaining these two errors for smooth convex surrogates, we provide a unified bound on the excess ϕ-risk Eϕ (h) of empirical convex risk minimizer in terms of n. We begin by bounding the error arising from solving the optimization problem numerically. One nice property of smoothed hinge loss function is that both its first order and 117 second order derivatives are bounded, i.e. |ϕ′ (z; γ)| = exp(γ(1 − z) ≤ 1, 1 + exp(γ(1 − z)) ϕ′′ (z; γ) = γ exp(γ(1 − z)) γ ≤ . 2 4 (1 + exp(γ(1 − z))) Due to the smoothness of ϕ(z; γ), we can apply the accelerated optimization algorithm [121, 143] to achieve an O(1/k 2 ) convergence rate for the optimization, where k is the number of iterations the optimization algorithm proceeds (see e.g., accelerated gradient descent algorithm in Subsection 2.3.2.2). More specifically, we will apply Algorithm 1 from [143] to solve the numerical optimization problem in (4.3) over the convex domain H = {h ∈ Hκ : ∥h∥Hκ ≤ B} which results in the following updating rules at sth iteration: gs = (1 − θs )hs + θs fs ) ( θs ϕ fs+1 = arg min ⟨∇LS (gs ), f − gs ⟩ + ∥f − fs ∥Hκ 2 f ∈H (4.4) hs+1 = (1 − θs )hs + θs fs+1 . The following theorem that follows immediately from [143, Corollary 1] and the fact ϕ′′ (z; γ) ≤ γ/4, bounds the optimization error for the optimization problem after k iterations. Lemma 4.9. Let h = hk+1 be the solution obtained by running accelerated gradient descent method (i.e., updating rules in (4.4)) to solve the optimization problem in (4.3) after k iterations with θ0 = 1 and θk = 2/(k + 2) for k ≥ 1. We have ϕ LS (h) ≤ min ϕ ∥h∥H ≤B κ LS (h) + γB 2 . (k + 2)2 We now turn to understanding the generalization error for the smooth convex loss. There are many theoretical results giving upper bounds of the generalization error. However, a 118 recent result [138] has showed that it is possible to obtain optimistic rates for generalization bound of smooth convex loss (in the sense that smooth losses yield better generalization bounds when the problem is easier), which are more appealing than the generalization of simple Lipschitz continuous losses. The following theorem from [138, Theorem 1] bounds the generalization error for any solution h ∈ H when the learning has been performed by a smooth convex surrogate ϕ(·). Lemma 4.10. With a probability 1 − δ, for any ∥h∥Hκ ≤ B, we have ( ) 2 )t (B + γB ϕ ϕ ϕ LD (h) − LS (h) ≤ K1 LS (h) n ) ( √ 2 )t 2 )t (B + γB (B + γB ϕ ϕ ϕ LD (h) − LS (h) ≤ K2 + LD (h) . n n (B + γB 2 )t + n √ where t = log(1/δ) + log3 n and K1 and K2 are universal constants. ˜ √n) The bound stated in this lemma is optimistic in the sense that it reduces to O(1/ ˜ when the problem is difficult and be better when the problem is easier, approaching O(1/n) ϕ,∗ for linearly separable data, i.e., LD = 0 in the second inequality. These two lemmas essentially enable us to transform a bound on the optimization error and generalization bound into a bound on the convex excess risk. In particular, by combining Lemma 4.9 with Lemma 4.10, we have the following theorem that bounds the smooth convex excess risk Eϕ (h) = LD (h) − LD (h∗λ ) for the empirical convex risk minimizer. ϕ ϕ Theorem 4.11. Let h be the solution output from updating rules in (4.4) after k iterations. Then, with a probability at least 1 − δ, we have γB 2 Eϕ (h) ≤ +K (k + 2)2 ( (B + γB 2 )t + n √ 2 ϕ,∗ (B + γB )t LD + n 119 √ γB 2 (B + γB 2 )t (k + 2)2 n ) ϕ,∗ ϕ where K is a universal constant, t = log(1/δ) + log3 n, and LD = min∥h∥ ≤B LD (h). Hκ Since our overall interest is to understand how the smoothing parameter γ affects the convergence rate of excess risk in terms of n, the number of training examples, it is better to parametrize both the number of iterations k and smoothing parameter γ in n, and bound the Eϕ (h) only in terms of n. This is given in the following corollary. Corollary 4.12. Assume γ ≥ 1 and B ≥ 1. Paramertize k and γ in terms of n as k + 2 = nα/2 and γ = nβ . Then, with a probability at least 1 − δ, ) ( ϕ,∗ Eϕ (h) ≤ C(B, t) nβ−α + nβ−1 + nβ−(α+1)/2 + [LD ]1/2 n(β−1)/2 (4.5) where C(B, t) is a constant depending on both B and t with t = log(1/δ) + log3 n. ϕ,∗ ϕ,∗ The bound given in (4.5) depends on LD . We would like to further characterize LD in terms of γ. First, we have 1 max max(0, 1 − z) + R(α) γ α∈[0,1] 1 log 2 ≤ max max(0, 1 − z) + log 2 = ϕHinge (z) + , γ γ α∈[0,1] ϕ(z; γ) = where ϕHinge (z) = max(0, 1 − z) is the hinge loss. As a result, we have ϕ,∗ Hinge,∗ LD ≤ LD Hinge,∗ where LD = min ∥h∥H ≤B κ + log 2 γ [ ] E(x,y)∼D ϕHinge (yh(x)) is the optimal risk with respect to the 120 hinge loss. In general, we will assume ϕ,∗ Hinge,∗ LD ≤ LD a + 1+ξ γ (4.6) ϕ,∗ Hinge,∗ where a > 0 is a constant and ξ ≥ 0 characterizes how fast LD will converge to LD with increasing γ. To see why the assumption in (4.6) is sensible, consider the case when the optimal classifier h∗Hinge = arg min∥h∥ ≤B LD Hκ Hinge (h) can perfectly classify all the data points with margin ϵ, in which we have ϕ,∗ Hinge,∗ LD ≤ LD +O ( −ϵγ ) e γ which satisfy the condition in (4.6) with arbitrarily large ξ. It is easy to verify that the condition (4.6) holds with ξ > 0 if h∗Hinge can perfectly classify O(1 − γ −1−ξ ) percentage of data with margin ϵ. Using the assumption in (4.6), we have the following result that characterizes the smooth Hinge,∗ convex excess risk bound Eϕ (h) stated in terms of the parameters α, δ and LD Theorem 4.13. Assume α ≥ 1/2. Set β as β= min(1/2, α − 1/2) . 1+ξ With a probability 1 − δ, we have Eϕ (h) ≤ O(n−τ1 + [LD Hinge,∗ 1/2 −τ2 ] n ) 121 . where τ1 = 1 + 2ξ min(1, α) , 2(1 + ξ) τ2 = 1/2 + ξ 2(1 + ξ) ϕ,∗ Proof. Replacing LD in Corollary 4.12 with the expression in (4.6), we have, with a probability 1 − δ, ( Hinge,∗ 1/2 (β−1)/2 Eϕ (h) ≤ C(R, t, a) nβ−α + nβ−1 + nβ−(α+1)/2 + [LD ] n + n−1/2−ξβ We first consider the case when α > 1. In this case, we have ( Hinge,∗ 1/2 (β−1)/2 Eϕ (h) ≤ O nβ−1 + n−1/2−ξβ + [LD ] n ) 1/2 By choosing β − 1 = −1/2 − ξβ, we have β = 1+ξ and Eϕ (h) ≤ O(n−(1/2+ξ)/(1+ξ) + [LD Hinge,∗ 1/2 −(1/2+ξ)/[2(1+ξ)] ] n In the second case, we have α ∈ [1/2, 1]. Hence we have ( ) Hinge,∗ 1/2 (β−1)/2 Eϕ (h) ≤ O nβ−α + [LD ] n + n−1/2−ξβ α−1/2 By setting β − α = −1/2 − ξβ, we have β = 1+ξ and ( Eϕ (h) ≤ O ξα+1/2 Hinge,∗ 1/2 −(1/2+ξ)/[2(1+ξ)] n 1+ξ + [LD ] n − We complete the proof by combining the results for the two cases. 122 ) . ) 4.4.2 Bounding Binary Excess Risk E(h) We now combine the results from Theorem 4.8 and Corollary 4.12 to bound E(h). Theorem 4.14. Assume α ≥ 1/2. For a failure probability δ ∈ (0, 1), define n0 as ( n0 ≤ K3 (B, δ) 1 )1/(2τ −2τ ) 1 2 Hinge,∗ LD where K3 (B, δ) is a constant depending on B and δ, and τ1 and τ2 are defined in Theorem 4.13. Set β as that in Theorem 4.13 if n ≤ n0 and 0, otherwise. Then, with a probability 1 − δ, we have    K4 (B, δ)n−τ1 log n n ≤ n0 Eϕ ≤   K (B, δ)n−1/2 log n n > n 5 0 where K4 (B, δ) and K5 (B, δ) are constants depending on B and δ. Theorem 4.14 follows from Theorem 4.8 and similar analysis for Theorem 4.13, from which we have ) ) ( ( ϕ ϕ,∗ E(h) = LD (h) − L∗D ≤ O min γ −1 , LD (h) − LD log n Remark 4.15. According to Theorem 4.14, when the number of training examples n is not too large, for the binary excess risk of empirical minimizer we have, with a high probability, E(h) ≤ O(n−τ1 log n). In the case when ξ > 0 and α > 1/2 (i.e. when the number of optimization iterations is 123 larger than √ ϕ,∗ Hinge,∗ n and LD converges to LD faster than 1/γ), we have τ1 > 1/2, implying that using a smooth convex loss will lead to a generalization error bound better than O(n−1/2 ) when the number of training examples is limited. This implies that for smooth loss function to achieve a binary excess error to the extent which is achievable by corresponding non-smooth loss we can run the first order optimization method for a less number of iterations. This is because our result examines the binary excess risk by taking into account the optimization complexity. We also note 1/(2τ1 − 2τ2 ) is given by 1+ξ 1 = 2τ1 − 2τ2 1/2 + ξ min(1, 2α − 1) Hinge,∗ −2 ] , which could be a large number when When α ≤ 3/4, we have n0 ≥ K3 (B, δ)[LD Hinge,∗ LD is very small. 4.5 Proofs of Statistical Consistency 4.5.1 Proof of Theorem 4.5 We first compute z = arg min z′ 1+η 1−η ϕ(z ′ ; γ) + ϕ(−z ′ ; γ) 2 2 By setting the derivative to be zero, we have 1+η 1−η = 1 + exp(−γ(1 − z)) 1 + exp(−γ(1 + z)) 124 and therefore (1 + η) exp(−γz) − (1 − η) exp(γz) + 2η exp(γ) = 0. Solving the equation, we obtain exp(−γz) = −η exp(γ) + and exp(γz) = η exp(γ) + √ η 2 exp(2γ) + (1 − η 2 ) 1+η √ η 2 exp(2γ) + (1 − η 2 ) . 1−η It is easy to verify that sgn(z) = sgn(η). This is because if η > 0, we have √ √ 1 − η2 1−η exp(−γz) ≤ = <1 1+η 1+η and therefore z > 0. On the other hand, when η < 0, we have exp(γz) = 1+η √ √ ≤ −η exp(γ) + η 2 exp(2γ) + (1 − η 2 ) 1+η < 1, 1−η and therefore z < 0. Using the solution for z, we compute ϕ(η) as 1+η 1−η 1+η 1−η ϕ(z; γ) + ϕ(z; γ) − min ϕ(z; γ) + ϕ(z; γ) z 2 2 2 2 1 + exp(γ(1 − z)) 1 − η 1 + exp(γ(1 + z)) 1+η log − log . = − 2γ 1 + exp(γ) 2γ 1 + exp(γ) ψ(η; γ) = 125 By defining constants C1 = −ηeγ + √ η 2 e2γ + 1 − η 2 and C2 = ηeγ + √ η 2 e2γ + 1 − η 2 , we can rewrite the transform function ψ(η; γ) as 1+η ψ(η; γ) = − log 2γ 4.5.2 ( [ ]) ( [ ]) C1 C2 1 1−η 1 γ γ 1+e − log 1+e . 1 + eγ 1+η 2γ 1 + eγ 1−η Proof of Theorem 4.6 Since the expression for ψ(η; γ) is symmetric in terms η, we will only consider the case when η > 0. First, we have 1−η C1 eγ 1−η √ ≤ = . 1+η 2η η + η 2 + (1 − η 2 )e−2γ Similarly, we have C2 eγ eγ = 1−η 1−η ) ( √ 1 + η 2γ γ ηe + η 2 e2γ + 1 − η 2 ≤ e 1−η Thus, we have ( ) ( ) 1−η 1+η γ 1+η 1+η 1−η γ log(1 + e ) − log − log e ψ(η; γ) ≥ 2γ 2γ 2η 2γ 1−η ( ) ( ) 1+η 1−η 1+η 1−η ≥ η− log − log 2γ 2η 2γ 1−η ( ( ) ) 2 1−η 1+η 1 1 η 1 1 + = η − log + + ≥ η − log γ 4η 2 γ 4η 4 2 where the last inequality follows from the concaveness of log(·) function. As a result when η ∈ (−1, 1) we have η 1 1 1 + + ≤ , 4η 4 2 η 126 which completes the proof. 4.5.3 Proof of Theorem 4.11 Applying Lemmas 4.9 and 4.10 to the solution to the empirical convex risk minimizer h, we have ( ) √ 2 )t 2 )t (B + γB (B + γB ϕ ϕ ϕ LD (h) ≤ LS (h) + K1 + LS (h) (4.7) n n √ ( ) √ 2 2 2 γB (B + γB )t γB 2 (B + γB 2 )t ϕ ∗ ϕ ∗ (B + γB )t ≤ LS (hγ ) + + K1 + LS (hγ ) + n n (k + 2)2 (k + 2)2 n On the other hand, by the application of the Bernstein’s inequality [30], with probability at least 1 − δ we have [( 4B log 1δ ϕ ϕ LS (h∗γ ) − LD (h∗γ ) ≤ + n 4B log 1δ ≤ + n 4E(x,y)∼D √ ] )2 ϕ ϕ(yh∗γ (x); γ) − LD (h∗γ ) log 1δ n (4.8) ϕ 8BLD (h∗γ ) log 1δ . n We conclude the proof by plugging in (7.7) with (7.6), replacing the constants with a new universal constant K, and noting that t = log 1δ + log3 n . 4.6 Summary In this chapter we have investigated how the smoothness of loss function being used as the surrogate of 0-1 loss function in empirical risk minimization affects the excess binary risk. While the relation between convex excess risk and binary excess risk being provably 127 established previously under weakest possible condition such as differentiability, it was not immediately obvious how smoothness of convex surrogate relates to statistical consequences. This chapter made first step towards understanding this affect. In particular, in contrast to optimization and generalization analysis that favor smooth surrogate losses, our results revealed that smoothness degrades the binary excess risk. To investigate guarantees on which the smoothness would be a desirable property, we proposed a unified analysis that integrates errors in optimization, generalization, and translating convex excess risk into binary excess risk. Our result shows that under favorable conditions and with appropriate choice of smoothness parameter, a smoothed hinge loss can achieve a binary excess risk that √ is better than O(1/ n). 128 Chapter 5 Regret Bounded by Gradual Variation The focus so far in this thesis has been on statistical learning where we assumed that the learner is provided with a pool of i.i.d training examples according to a fixed and unknown distribution D from the instance space Ξ = X ×Y and is asked to output a hypothesis h ∈ H which achieves a good generalization performance. This statistical assumption permits the estimation of the generalization error and the uniform convergence theory provides basic guarantees on the correctness of future predictions. We turn now to the sequential prediction setting in which no statistical assumption is made about the sequence of observations. In particular, we consider the online convex optimization problem introduced in Chapter 2 where the ultimate goal is to devise efficient algorithms in adversarial environments with sub-linear regret bounds in terms of the number of rounds the game proceeds. We have seen a wide variety of algorithms such as Follow The Perturbed Leader (FTPL) for linear and combinatorial online learning problems, and a simple Online Gradient Descent (OGD), Follow The Regularized Leader (FTRL), and Online √ Mirror Descent (OMD) algorithms for general convex functions which attain an O( T ) and O(log T ) regret bounds for Lipschitz continuous and strongly convex functions, respectively. Most previous works, including those discussed above, considered the most general setting in which the loss functions could be arbitrary and possibly chosen in an adversarial way. However, the environments around us may not always be fully adversarial, and the loss functions may have some patterns which can be exploited for achieving a smaller regret. For 129 example, the weather condition or the stock price at one moment may have some correlation with the next and their difference is usually small, while abrupt changes only occur sporadically. Consequently, it is objected that requiring an algorithm to have a small regret for all sequences leads to results that are too loose to be practically interesting, and the bounds obtained for worst case scenarios become pessimistic for these regular sequences. Recently, it has been shown that the regret of the FTRL algorithm for online linear optimization can be bounded by the total variation of the cost vectors rather than the number of rounds. This result is appealing for the scenarios where the sequence of loss functions have a pattern and are not fully adversarial. In this chapter we extend this result to general online convex optimization and introduce a new measure referred to as gradual variation to capture the variation of consecutive convex functions. We show that the total variation bound is not necessarily small when the cost functions change slowly, and the gradual variation lower bounds the total variation. To establish the main results, we discuss a lower bound on the performance of the FTRL that maintains only one sequence of solutions, and a necessary condition on smoothness of the cost functions for obtaining a gradual variation bound. We then present two novel algorithms, improved FTRL and Online Mirror Prox (OMP), that bound the regret by the gradual variation of cost functions. Unlike previous approaches that maintain a single sequence of solutions, the proposed algorithms maintain two sequences of solutions that makes it possible to achieve a gradual variation-based regret bound for online convex optimization. We also extend the main results two-fold: (i) we present a general method to obtain a gradual variation bound measured by general norms rather than the ℓ2 norm and specialize it to three online learning settings, namely online linear optimization, prediction with expert advice, and online strictly convex optimization; (ii) we develop a deterministic algorithm for online 130 bandit optimization in multipoint bandit setting based on the proposed OMP algorithm. 5.1 Variational Regret Bounds Recall that in online convex optimization problem, at each trial t, the learner is asked to predict the decision vector wt that belongs to a bounded closed convex set W ⊆ Rd ; it then receives a cost function ft : W → R+ from a family of convex functions F and incurs a cost of ft (wt ) for the submitted solution. The goal of online convex optimization is to come up with a sequence of solutions w1 , . . . , wT ∈ W that minimizes the regret, which is defined as the difference in the cost of the sequence of decisions accumulated up to the trial T made by the learner and the cost of the best fixed decision in hindsight, i.e. RegretT = T ∑ ft (wt ) − min w∈W t=1 T ∑ ft (w). (5.1) t=1 The goal of online convex optimization is to design algorithms that predict, with a small regret, the solution wt at the tth trial given the (partial) knowledge about the past cost functions f1 , f2 , · · · , ft−1 ∈ F . As already mentioned, generally most previous studies of online convex optimization bound the regret in terms of the number of trials T . In particular for general convex Lipschitz √ continuous and strongly convex functions regret bounds of O( T ) and O(log T ) have been established, respectively, which are known to minimax optimal. However, it is expected that the regret should be low in an unchanging environment or when the cost functions are somehow correlated. Ideally, the tightest rate for the regret should depend on the variance of the sequence of cost functions rather than the number of rounds T . Consequently, the 131 bounds obtained for worst case scenarios in terms of number of iterations become pessimistic for these regular sequences and too loose to be practically interesting. Therefore, it is of great interest to derive a variation-based regret bound for online convex optimization in an adversarial setting. Recently [69] made a substantial progress in this route and proved a variation-based regret bound for online linear optimization by tight analysis of FTRL algorithm with an appropriately chosen step size. A similar regret bound is shown in the same paper for prediction from expert advice by slightly modifying the multiplicative weights algorithm. In this chapter, we take one step further and contribute to this research direction by developing algorithms for general framework of online convex optimization with variation-based regret bounds. When all the cost functions are linear, i.e., ft (w) = ⟨ft , w⟩, where ft ∈ Rd is the cost vector in trial t, online convex optimization becomes online linear optimization. Many decision problems can be cast into online linear optimization problem, such as prediction from expert advice [37], online shortest path problem [142]. The first variation-based regret bound for online linear optimization problems in an adversarial setting has been shown in [69]. Hazan and Kale’s algorithm for online linear optimization is based on the framework of FTRL. At each trial, the decision vector wt is given by solving the following optimization problem: wt = arg min w∈W t−1 ∑ ⟨fτ , w⟩ + τ =1 1 ∥w∥22 , 2η (5.2) where ft is the cost vector received at trial t after predicting the decision wt , and η is a step 132 size. They bound the regret by the variation of cost vectors defined as VariationT = T ∑ ∥ft − µ∥22 , (5.3) t=1 where µ = 1/T ∑T t=1 ft . By assuming ∥ft ∥2 ≤ 1, ∀t and setting the value of η to η = √ min(2/ VariationT , 1/6), they showed that the regret of FTRL can be bounded by T ∑ t=1 ⟨ft , wt ⟩ − min w∈W T ∑ ⟨ft , w⟩ ≤ t=1  √   15 VariationT   150 √ VariationT ≥ 12 . √ if VariationT ≤ 12 if (5.4) From (5.4), we can see that when the variation of the cost vectors is small (less than 12), the √ regret is a constant, otherwise it is bounded by the variation O( VariationT ). This result indicates that online linear optimization in the adversarial setting is as efficient as in the stationary stochastic setting. 5.2 Gradual Variation and Necessity of Smoothness Here we introduce a new measure to characterize the efficiency of online learning algorithms in evolving environments which is termed as gradual variation. The motivation of defining gradual variation stems from two observations: one is practical and the other one is technical raised by the limitation of extending the results in [69] to general convex functions. From practical point of view, we are interested in a more general scenario, in which the environment may be evolving but in a somewhat gradual way. For example, the weather condition or the stock price at one moment may have some correlation with the next and their difference is usually small, while abrupt changes only occur sporadically. 133 Algorithm 2 Linearalized Follow The Regularized Leader for OCO 1: Input: η > 0 2: Initialize: w1 = 0 3: for t = 1, . . . , T do 4: Predict wt by t−1 ∑ 1 wt = arg min ⟨fτ , w⟩ + ∥w∥22 2η w∈W τ =1 5: Receive cost function ft (·) and incur loss ft (wt ) 6: Compute ft = ∇ft (wt ) 7: end for In order to understand the limitation of extending the results in [69], let us apply the results to general convex loss functions. This is an important problem in its own as online convex optimization generalizes online linear optimization by replacing linear cost functions with non-linear convex cost functions and covers many other sequential decision making problems. For instance, it has found applications in portfolio management [6] and online classification [88]. In online portfolio management problem, an investor wants to distribute his wealth over a set of stocks without knowing the market output in advance. If we let wt denote the distribution on the stocks and rt denote the price relative vector, i.e., rt [i] denote the the ratio of the closing price of stock i on day t to the closing price on day t − 1, then an interesting function is the logarithmic growth ratio, i.e. ∑T t=1 log(⟨wt , rt ⟩), which is a concave function to be maximized. Since the results in [69] were developed for linear loss functions, a straightforward approach is to use the first order approximation for convex loss functions, i.e., ft (w) ≃ ft (wt ) + ⟨∇ft (wt ), w − wt ⟩, and replace the linear loss vector with the gradient of the loss function ft (w) at wt . The resulting algorithm is shown in Algorithm 2. Using the 134 convexity of loss function ft (w), we have T ∑ ft (wt ) − min w∈W t=1 T ∑ T T ∑ ∑ ft (w) ≤ ⟨ft , wt ⟩ − min ⟨ft , w⟩. t=1 w∈W t=1 (5.5) t=1 If we assume ∥∇ft (w)∥2 ≤ 1, ∀t ∈ [T ], ∀w ∈ W, we can apply Hazan and Kale’s variationbased bound in (5.4) to bound the regret in (5.5) by the variation of the cost functions as: VariationT = T ∑ ∥ft − µ∥22 = t=1 T ∑ t=1 T 1 ∑ ∇ft (wt ) − ∇fτ (wτ ) T τ =1 2 . (5.6) 2 To better understand VariationT in (5.6), we rewrite it as VariationT = T ∑ t=1 = 1 2T T 1 ∑ ∇fτ (wτ ) ∇ft (wt ) − T τ =1 T ∑ 2 2 ∥∇ft (wt ) − ∇fτ (wτ )∥2 t,τ =1 T T T T 1 ∑∑ 1 ∑∑ 2 ≤ ∥∇ft (wt ) − ∇ft (wτ )∥2 + ∥∇ft (wτ ) − ∇fτ (wτ )∥22 T T t=1 τ =1 t=1 τ =1 = Variation1T + Variation2T . We see that the variation VariationT is bounded by two parts: Variation1T essentially measures the smoothness of individual cost functions, while Variation2T measures the variation in the gradients of cost functions. Let us consider an easy setting when all cost functions are 135 identical. In this case, Variation2T vanishes, and VariationT is equal to Variation1T /2, i.e., VariationT = T 1 ∑ ∥∇ft (wt ) − ∇fτ (wτ )∥2 2T t,τ =1 T 1 ∑ = ∥∇ft (wt ) − ∇ft (wτ )∥2 2T t,τ =1 = Variation1T . 2 As a result, the regret of the FTRL algorithm for online convex optimization may still be √ bounded by O( T ) regardless of the smoothness of the cost functions. To address this challenge, we develop two novel algorithms for online convex optimization that bound the regret by the variation of cost functions. In particular, we would like to bound the regret of online convex optimization by the variation of cost functions defined as follows: GradualVariationT = T∑ −1 t=1 max ∥∇ft+1 (w) − ∇ft (w)∥22 . w∈W (5.7) Note that the variation in (5.7) is defined in terms of gradual difference between individual cost function to its previous one, while the variation in (5.3) is defined in terms of total difference between individual cost vectors to their mean. Therefore we refer to the variation defined in (5.7) as gradual variation, and to the variation defined in (5.3) as total variation. It is straightforward to show that when ft (w) = ⟨ft , w⟩, the gradual variation GradualVariationT is upper bounded by the total variation VariationT defined with a constant factor: T∑ −1 t=1 ∥ft+1 − ft ∥22 ≤ T∑ −1 2∥ft+1 − µ∥22 + 2∥ft − µ∥22 ≤ 4 t=1 T ∑ t=1 136 ∥ft − µ∥22 . On the other hand, we can not bound the total variation by the gradual variation up to a constant. This is verified by the following example. Let us assume that the adversary plays a fixed function f for the first half of the iterations and another different function g for the second half of the iterations, i.e., f1 = · · · = fT /2 = f and fT /2+1 = · · · = fT = g ̸= f . Then, in this simple scenario the total variation of the sequence of cost functions in (5.3) is given by VariationT = T ∑ t=1 T ∥ft − µ∥22 = 2 f +g 2 T f +g 2 f− + g− = Ω(T ), 2 2 2 2 2 while the gradual variation defined in (5.7) is a constant given by GradualVariationT = T∑ −1 ∥ft+1 − ft ∥22 = ∥f − g∥22 = O(1). t=1 Based on the above analysis, we claim that the regret bound by gradual variation is usually tighter than total variation. In particular, the following theorem shows a lower bound on the performance of the FTRL in terms of gradual variation. Unlike the standard setting of online learning where the FTLR achieves the optimal regret bound for Lipschitz continuous and strongly convex losses, it is not capable of achieving regret bounded by gradual variation. The result of this theorem motivates us to develop new algorithms for √ online convex optimization to achieve a gradual variation bound of O( GradualVariationT ). For the ease of exposition we use GVT to denote the gradual variation after T iterations. Theorem 5.1. The regret of FTRL is at least Ω(min(GVT , √ T )). Proof. Let f be any unit vector passing through w1 . Let s = ⌊1/η⌋, so that if we use ft = f for every t ≤ s, each such zt+1 = w1 − tηf still remains in W and thus wt+1 = zt+1 . Next, we analyze the regret by considering the following three cases depending on the range of s. 137 √ T. √ First, when s ≥ T , we choose ft = f for t from 1 to ⌊s/2⌋ and ft = 0 for the remaining Case I: s ≥ t. Clearly, the best strategy of the offline algorithm is to play w = −f . On the other hand, since the learning rate η is too small, the strategy wt played by GD, for t ≤ ⌊s/2⌋, is far away from w, so that ⟨ft , wt − w⟩ ≥ 1 − tη ≥ 1/2. Therefore, the regret is at least √ ⌊s/2⌋ (1/2) = Ω( T ). √ Case II: 0 < s < T . √ Second, when 0 < s < T , the learning rate is high enough so that FTRL may overreact to each loss vector, and we make it pay by flipping the direction of loss vectors frequently. More precisely, we use the vector f for the first s rounds so that wt+1 = w1 − tηf for any t ≤ s, but just as ws+1 moves far enough in the direction of −f , we make it pay by switching the loss vector to −f , which we continue to use for s rounds. Note that ws+1+r = ws+1−r but ∑ fs+1+r = −fs+1−r for any r ≤ s, so 2s t=1 ⟨ft , wt − w1 ⟩ = ⟨fs+1 , ws+1 − w1 ⟩ ≥ Ω(1). As w2s+1 returns back to w1 , we can see the first 2s rounds as a period, which only contributes ∥2f ∥22 = 4 to the deviation. Then we repeat the period for τ times, where τ = ⌊GVT /4⌋ if there are enough rounds, with ⌊T /(2s)⌋ ≥ ⌊GVT /4⌋, to use up the gradual variation GVT , and τ = ⌊T /(2s)⌋ otherwise. For any remaining round t, we simply choose ft = 0. As a √ result, the total regret is at least Ω(1) · τ = Ω(min{GVT /4, T /(2s)}) = Ω(min{GVT , T }). Case III: s = 0. Finally, when s = 0, the learning rate is so high that we can easily make GD pay by flipping the direction of the loss vector in each round. More precisely, by starting with f1 = −f , we can have w2 on the boundary of W, which means that if we then alternate between f and −f , the strategies FTRL plays will alternate between w3 and w2 which have a constant distance from each other. Then following the analysis in the second case, one can show that 138 the total regret is at least Ω(min{GVT , T }). Assumption 5.2. In this study, we assume smooth cost functions with Lipschitz continuous gradients, i.e., there exists a constant L > 0 such that ∥∇ft (w) − ∇ft (w′ )∥2 ≤ L∥w − w′ ∥2 , ∀w, w′ ∈ W, ∀t. (5.8) We would like to emphasize that our assumption about the smoothness of cost functions is necessary to achieve the variation-based bound stated in this chapter. To see this, consider the special case of f1 (w) = · · · = fT (w) = f (w). If we are able to achieve a regret bound which scales as the square roof of the gradual variation, for any sequence of convex functions, then for the special case where all the cost functions are identical, we have T ∑ t=1 implying that wT = f (wt ) ≤ min w∈W T ∑ f (w) + O(1), t=1 ∑T t=1 wt /T approaches the optimal solution at the rate of O(1/T ). √ This contradicts the lower complexity bound (i.e., O(1/ T )) for any optimization method which only uses first order information about the cost functions [119, Theorem 3.2.1] (see also Table 2.1). This analysis indicates that the smoothness assumption is necessary to attain variation based regret bound for general online convex optimization problem. We would like to emphasize the fact that this contradiction holds when only the gradient information about the cost functions is provided to the learner and the learner may be able to achieve a variation-based bound using second order information about the cost functions, which is not the focus of this chapter. 139 −ηft wt+1 = wt − ηft ˆ t = wt − ηft−1 w −ηft−1 wt = wt−1 − ηft−1 W −ηft−1 . Figure 5.1: Illustration of the main idea behind the proposed improved FTRL and online mirror prox methods to attain regret bounds in terms of gradual variation for linear loss ˆ t instead of wt to suffer less regret when the functions. The learner plays the decision w consecutive loss functions are gradually evolving. 5.3 The Improved FTRL Algorithm As mentioned earlier, the ultimate goal of this chapter is to have algorithms that can take advantage of benign sequences in gradually evolving environments and at the same time protect against the adversarial sequences. However, the impossibility result we showed in the previous section and in particular Theorem 5.1, demonstrated that the existing algorithms such as OGD and in general the family of follow the regularized leader algorithms fail to attain a regret bounded by the gradual variation of the loss functions. Motivated by this negative result, we now turn to proposing two algorithms for online convex optimization 140 that are able to attain regret bounds in terms of gradual variation. The first algorithm is an improved FTRL and the second one is based on the mirror prox method introduced in Chapter 2. One common feature shared by the two algorithms is that both of them maintain two sequences of solutions: decision vectors w1:T = (w1 , · · · , wT ) and searching vectors z1:T = (z1 , · · · , zT ) that facilitate the updates of decision vectors. Both algorithms share almost the same regret bound except for a constant factor. All of our algorithms are based on the following idea, which we illustrate using the online linear optimization problem as an example which is graphically depicted in Figure 5.1. For general linear functions, the online gradient descent algorithm is known to achieve an optimal regret, which plays in round t the point wt = ΠW (wt−1 − ηft−1 ). Now, if the loss functions have a small deviation, ft−1 may be close to ft , so in round t, it may be a good idea to play a point which moves further in the direction of −ft−1 as it may make its inner product with ft (which is its loss with respect to ft ) smaller. In fact, it can be shown that if one could play the point wt+1 = ΠW (wt − ηft ) in round t, a very small regret could be achieved, i.e., ∑T t=1 ⟨wt+1 , ft ⟩ − minw∈W ∑T t=1 ⟨w, ft ⟩ ≤ O(1), but in reality one does not have ft available before round t to compute wt+1 . On the other hand, if ft−1 is a good estimate of ˆ t = ΠW (wt − ηft−1 ) should be a good estimate of wt+1 too. The point w ˆt ft , the point w ˆt can actually be computed before round t since ft−1 is available, so our algorithm plays w in round t. As it will be clear later in this chapter, our algorithms for the prediction with expert advice problem and the online convex optimization problem use the same idea to be able to achieve regret bounds stated in terms of gradual variation of the sequence of losses. To facilitate the discussion, besides the variation of cost functions defined in (5.7), we 141 Algorithm 3 Improved FTRL (IFTRL) Algorithm 1: Input: η ∈ (0, 1] 2: Initialize:: z0 = 0 and f0 (w) = 0 3: for t = 1, . . . , T do 4: Predict wt by { } L 2 wt = arg min ⟨w, ∇ft−1 (zt−1 ⟩) + ∥w − zt−1 ∥2 2η w∈W 5: 6: Receive cost function ft (·) and incur loss ft (wt ) Update zt by { } t ∑ L zt = arg min ⟨w, ∇fτ (zτ −1 )⟩ + ∥w∥22 2η w∈W τ =1 7: end for define another variation, named extended gradual variation, as follows EGVT,2 (y1:T ) = T∑ −1 ∥∇ft+1 (yt ) − ∇ft (yt )∥22 ≤ ∥∇f1 (y0 )∥22 + GVT , (5.9) t=0 where f0 (w) = 0, the sequence (y0 , . . . , yT ) is either (z0 , . . . , zT ) (as in the improved FTRL) or (w0 , . . . , wT ) (as in the online mirror prox method) and the subscript 2 means the variation is defined with respect to ℓ2 norm. When all cost functions are identical, GVT becomes zero and the extended variation EGVT,2 (y1:T ) is reduced to ∥∇f1 (y0 )∥22 , a constant independent from the number of trials. In the sequel, we use the notation EGVT,2 for simplicity. Our results show that for online convex optimization with L-smooth cost functions, the regrets of the proposed algorithms can be bounded as follows T ∑ t=1 ft (wt ) − min w∈W T ∑ ft (w) ≤ O (√ ) EGVT,2 + constant. (5.10) t=1 We now turn to presenting our first algorithm, dubbed IFTRL, which is a simple modification of the FTRL algorithm and show that its regret bounded by the gradual variation. 142 The improved FTRL algorithm for online convex optimization is presented in Algorithm 3. Without loss of generality, we assume that the decision set W is contained in a unit ball B = {x ∈ Rd : ∥x∥ ≤ 1}, i.e., W ⊆ B, and 0 ∈ W. Note that in step 6, the searching vectors zt are updated according to the FTRL algorithm after receiving the cost function ft (w). To understand the updating procedure for the decision vector wt specified in step 4, we rewrite it as { } L 2 wt = arg min ft−1 (zt−1 ) + ⟨w − zt−1 , ∇ft−1 (zt−1 )⟩ + ∥w − zt−1 ∥2 . 2η w∈W (5.11) Notice that L ∥w − zt−1 ∥22 2 L ≤ ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ + ∥w − zt−1 ∥22 , 2η ft (w) ≤ ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ + where the first inequality follows the smoothness condition in (5.8) and the second inequality follows from the fact η ≤ 1. The inequality (5.12) provides an upper bound for ft (w) and therefore can be used as an approximation of ft (w) for predicting wt . However, since ∇ft (zt−1 ) is unknown before the prediction, we use ∇ft−1 (zt−1 ) as a surrogate for ∇ft (zt−1 ), leading to the updating rule in (5.11). It is this approximation that leads to the variation bound. The following theorem states the regret bound of Algorithm 3. Theorem 5.3. Let f1 , f2 , . . . , fT be a sequence of convex functions with L-Lipschitz contin{ } √ uous gradients. By setting η = min 1, L/ EGVT,2 , we have the following regret bound 143 for the IFTRL in Algorithm 3: T ∑ ft (wt ) − min t=1 w∈W T ∑ ) ( √ ft (w) ≤ max L, EGVT,2 . t=1 Remark 5.4. Comparing with the variation bound in (5.7) for the FTRL algorithm, the smoothness parameter L plays the same role as Variation1T that accounts for the smoothness of cost functions, and term EGVT,2 plays the same role as Variation2T that accounts for the variation in the cost functions. Compared to the FTRL algorithm, the key advantage of the improved FTRL algorithm is that the regret bound is reduced to a constant when the cost functions change only by a constant number of times along the horizon. Of course, the extended variation EGVT,2 may not be known apriori for setting the optimal η, we can apply the standard doubling trick [38] to obtain a bound that holds uniformly over time and is a factor at most 8 from the bound obtained with the optimal choice of η. The details are provided later in this chapter. 5.4 The Online Mirror Prox Algorithm The second algorithm we present to attain regret bounds in terms of gradual variation is based on the prox method we introduced in Chapter 2 for non-smooth convex optimization. We generalize the prox method for online convex optimization that shares the same order of regret bound as the improved FTRL algorithm. The detailed steps of the Online Mirror Prox (OMP) method are shown in Algorithm 4, where we use an equivalent form of updates for wt and zt in order to compare to Algorithm 3 . The OMP method is closely related to the prox method in [117] by maintaining two sets of vectors w1:T and z1:T , where wt and 144 Algorithm 4 Online Mirror Prox (OMP) Algorithm 1: Input: η > 0 2: Initialize:: z0 = w0 = 0 and f0 (w) = 0 3: for t = 1, . . . , T do 4: Predict wt by { } L 2 wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + ∥w − zt−1 ∥2 2η w∈W 5: 6: Receive cost function ft (·) and incur loss ft (wt ) Update zt by } { L 2 zt = arg min ⟨w, ∇ft (wt )⟩ + ∥w − zt−1 ∥2 2η w∈W 7: end for zt are computed by gradient mappings using ∇ft−1 (wt−1 ), and ∇ft (wt ), respectively, as ( 1 w − zt−1 − w∈W 2 ( 1 zt = arg min w − zt−1 − w∈W 2 wt = arg min ) 2 η ∇ft−1 (wt−1 ) L 2 ) 2 η ∇ft (wt ) L 2 The OMP differs from the IFTRL algorithm: (i) in updating the searching points zt , Algorithm 3 updates zt by the FTRL scheme using all the gradients of the cost functions at {zτ }t−1 τ =1 , while OMP updates zt by a prox method using a single gradient ∇ft (wt ), and (ii) in updating the decision vector wt , OMP uses the gradient ∇ft−1 (wt−1 ) instead of ∇ft−1 (zt−1 ). The advantage of OMP algorithm compared to the IFTRL algorithm is that it only requires to compute one gradient ∇ft (wt ) for each loss function; in contrast, the improved FTRL algorithm in Algorithm 3 needs to compute the gradients of ft (w) at two searching points zt and zt−1 . It is these differences that make it easier to extend the OMP to a bandit setting, which will be discussed in Section 5.7. The following theorem states the regret bound of the online mirror prox method for online convex optimization. 145 Algorithm 5 Online Mirror Prox Method for General Norms 1: Input: η > 0, Φ(z) 2: Initialize:: z0 = w0 = minz∈W Φ(z) and f0 (w) = 0 3: for t = 1, . . . , T do 4: Predict wt by { } L wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 ) η w∈W 5: 6: Receive cost function ft (·) and incur loss ft (wt ) Update zt by { } L zt = arg min ⟨w, ∇ft (wt )⟩ + B(w, zt−1 ) η w∈W 7: end for Theorem 5.5. Let f1 , f2 , . . . , fT be a sequence of convex functions with L-Lipschitz contin{ √ } √ uous gradients. By setting η = (1/2) min 1/ 2, L/ EGVT,2 , we have the following regret bound for OMP in Algorithm 4 T ∑ ft (wt ) − min t=1 w∈W T ∑ ft (w) ≤ 2 max (√ ) √ 2L, EGVT,2 . t=1 We note that compared to Theorem 5.3, the regret bound in Theorem 5.5 is slightly worse by a factor of 2. 5.4.1 Online Mirror Prox Method with General Norms In this subsection, we first present a general OMP method to obtain a variation bound defined in a general norm. Then we discuss three special cases: online linear optimization, prediction with expert advice, and online strictly convex optimization. The omitted proofs in this subsection can be easily duplicated by mimicking the proof of Theorem 5.5, if necessary with the help of previous analysis as mentioned in the appropriate text. To adapt OMP to general norms other than the Euclidean norm, let ∥ · ∥ denote a general 146 norm, ∥ · ∥∗ denote its dual norm, Φ(z) be a α-strongly convex function with respect to the ( ) norm ∥ · ∥, and B(w, z) = Φ(w) − Φ(z) + ⟨w − z, Φ′ (z)⟩ be the Bregman distance induced by function Φ(w). Let f1 , f2 , · · · , fT be a sequence of smooth functions with Lipschitz continuous gradients bounded by L with respect to norm ∥ · ∥, i.e., ∥∇ft (w) − ∇ft (w′ )∥∗ ≤ L∥w − w′ ∥. (5.12) Correspondingly, we define the extended gradual variation based on the general norm as follows: EGVT = T∑ −1 ∥∇ft+1 (wt ) − ∇ft (wt )∥2∗ . (5.13) t=0 Algorithm 5 gives the detailed steps for the general framework. We note that the key differences from Algorithm 4 are: z0 is set to minz∈W Φ(z), and the Euclidean distances in steps 4 and 6 are replaced by Bregman distances, i.e., { } L wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 ) , η w∈W { } L zt = arg min ⟨w, ∇ft (wt )⟩ + B(w, zt−1 ) . η w∈W (5.14) The following theorem states the variation-based regret bound for the general norm framework, where R measure the size of W defined as √ R= 2( max Φ(w) − min Φ(w)). w∈W w∈W Theorem 5.6. Let f1 , f2 , · · · , fT be a sequence of convex functions whose gradients are Lsmooth continuous, Φ(z) be a α-strongly convex function, both with respect to norm ∥ · ∥, and {√ √ } √ EGVT be defined in (5.13). By setting η = (1/2) min α/ 2, LR/ EGVT , we have the 147 following regret bound T ∑ ft (wt ) − min T ∑ w∈W t=1 ft (w) ≤ 2R max (√ ) √ √ 2LR/ α, EGVT . t=1 In the following subsections we specialize the proposed general method to few specific online learning settings. 5.4.2 Online Linear Optimization Here we consider online linear optimization and present the algorithm and the gradual variation bound for this setting as a special case of proposed algorithm. In particular, we are interested in bounding the regret by the gradual variation f EGVT,2 = T∑ −1 ∥ft+1 − ft ∥22 , t=0 where ft , t = 1, . . . , T are the linear cost vectors and f0 = 0. Since linear functions are smooth functions that satisfy the inequality in (5.8) for any positive L > 0, therefore we can apply Algorithm 4 to online linear optimization with any positive value for L 1 . The regret bound of Algorithm 4 for online linear optimization is presented in the following corollary. Corollary 5.7. Let ft (w) = ⟨ft , w⟩, t = 1, . . . , T be a sequence of linear functions. By √ ( ) setting η = f 1/ 2EGVT,2 and L = 1 in Algorithm 4, then we have T ∑ t=1 ft⊤ wt − min w∈W T ∑ √ ⟨ft , w⟩ ≤ f 2EGVT,2 . t=1 1 We simply set L = 1 for online linear optimization and prediction with expert advice. 148 Remark 5.8. Note that the regret bound in Corollary 5.7 is stronger than the regret bound obtained in [71] for online linear optimization due to the fact that the gradual variation is smaller than the total variation. 5.4.3 Prediction with Expert Advice In the problem of prediction with expert advice, the decision vector w is a distribution over m experts, i.e., w ∈ W = {w ∈ Rm + : ∑m m i=1 wi = 1}. Let ft ∈ R denote the costs for m experts in trial t. Similar to [69], we would like to bound the regret of prediction from expert advice by the gradual variation defined in infinite norm, i.e., f EGVT,∞ = T∑ −1 ∥ft+1 − ft ∥2∞ . t=0 Since it is a special online linear optimization problem, we can apply Algorithm 4 to obtain a regret bound as in Corollary 5.7, i.e., T ∑ ft⊤ wt − min w∈W t=1 T ∑ √ √ ⟨ft , w⟩ ≤ f 2EGVT,2 ≤ f 2mEGVT,∞ . t=1 However, the above regret bound scales badly with the number of experts. We can obtain √ f a better regret bound in O( EGVT,∞ ln m) by applying the general prox method in Algorithm 5 with Φ(w) = ∑m i=1 wi ln wi and B(w, z) = ∑m i=1 wi ln(zi /wi ). The two updates in Algorithm 5 become i exp([η/L]f i ) zt−1 t−1 i wt = ∑ , i = 1, . . . , m j j m j=1 zt−1 exp([η/L]ft−1 ) i exp([η/L]f i ) zt−1 t i zt = ∑ , i = 1, . . . , m. j j m j=1 zt−1 exp([η/L]ft ) 149 (5.15) The resulting regret bound is formally stated in the following Corollary. Corollary 5.9. Let ft (w) = ⟨ft , w⟩, t = 1, . . . , T be a sequence of linear functions in pre√ ∑ f diction with expert advice. By setting η = (ln m)/EGVT,∞ , L = 1, Φ(w) = m i=1 wi ln wi and B(w, z) = ∑m i=1 wi ln(wi /zi ) in Algorithm 5, we have √ T T ∑ ∑ f ⟨ft , wt ⟩ − min ⟨ft , w⟩ ≤ 2EGVT,∞ ln m. w∈W t=1 t=1 By noting the definition of EGVfT,∞ , the regret bound in Corollary 5.9 is  O T∑ −1 t=0  i − f i | ln m , max |ft+1 t i which is similar to the regret bound obtained in [69] for prediction with expert advice. However, the definitions of the variation are not exactly the same. In [69], the authors bound the (√ ) ∑T i i 2 regret of prediction with expert advice by O ln m maxi t=1 |ft − µt | + ln m , where the variation is the maximum total variation over all experts. To compare the two regret bounds, we first consider two extreme cases. When the costs of all experts are the same, then the variation in Corollary 5.9 is a standard gradual variation, while the variation in [69] is a standard total variation. According to the previous analysis, a gradual variation is smaller than a total variation, therefore the regret bound in Corollary 5.9 is better than that in [69]. In another extreme case when the costs at all iterations of each expert are the same, both regret bounds are constants. More generally, if we assume the maximum total variation is small (say a constant), then By a trivial analysis ∑T −1 i i t=0 |ft+1 − ft | is also a constant for any i ∈ [m]. ∑T −1 ∑T −1 i i i i t=0 maxi |ft+1 − ft | ≤ m maxi t=0 |ft+1 − ft |, the regret bound in Corollary 5.9 might be worse up to a factor √ 150 m than that in [69]. Remark 5.10. It was shown in [41], both the regret bounds in Corollary 5.7 and Corollary 5.9 are optimal because they match the lower bounds for a special sequence of loss functions. In particular, for online linear optimization if all loss functions but the first √ √ f Tk = EGVT,2 are all-0 functions, then the known lower bound Ω( Tk ) matches the upper bound in Corollary 5.7. Similarly, for prediction from expert advice if all loss functions but √ √ f ′ the first Tk = EGVT,∞ are all-0 functions, then the known lower bound Ω( Tk′ ln m) [38] matches the upper bound in Corollary 5.9. 5.4.4 Online Strictly Convex Optimization In this subsection, we present an algorithm to achieve a logarithmic variation bound for online strictly convex optimization. In particular, we assume the cost functions ft (w) are not only smooth but also strictly convex defined formally in the following. Definition 5.11. For β > 0, a function f (w) : W → R is β-strictly convex if for any w, z ∈ W f (w) ≥ f (z) + ∇⟨f (z), w − z⟩ + β(w − z)⊤ ∇f (z)∇f (z)⊤ (w − z) (5.16) It is known that such a defined strictly convex function include strongly convex function and exponential concave function as special cases as long as the gradient of the function is bounded. To see this, if f (w) is a β ′ -strongly convex function with a bounded gradient ∥∇f (w)∥2 ≤ G, then f (w) ≥ f (z) + ⟨∇f (z), w − z⟩ + β ′ ⟨w − z, w − z⟩ β′ ≥ f (z) + ⟨∇f (z), w − z⟩ + 2 (w − z)⊤ ∇f (z)∇f (z)⊤ (w − z), G 151 (5.17) thus f (w) is a (β ′ /G2 ) strictly convex. Similarly if f (w) is exp-concave, i.e., there exists α > 0 such that h(w) = exp(−αf (w)) is concave, then f (w) is a β = 1/2 min(1/(4GD), α) strictly convex (c.f. Lemma 2 in [68]), where D is defined as the diameter of the domain. Therefore, in addition to smoothness and strict convexity we also assume all the cost functions have bounded gradients, i.e., ∥∇ft (w)∥2 ≤ G. We now turn to deriving a logarithmic gradual variation bound for online strictly convex optimization. To this end, we need to change the Euclidean distance function in Algorithm 4 to a generalized Euclidean distance function. Specifically, at trial t, we let Ht = I + βG2 I + β ∑t−1 1 ⊤ τ =0 ∇fτ (wτ )∇fτ (wτ ) and use the generalized Euclidean distance Bt (w, z) = 2 ∥w − z∥2H = 21 (w − z)⊤ Ht (w − z) in updating wt and zt , i.e., t { } 1 2 wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + ∥w − zt−1 ∥H t 2 w∈W { } 1 zt = arg min ⟨w, ∇ft (wt )⟩ + ∥w − zt−1 ∥2H , t 2 w∈W (5.18) To prove the regret bound, we can prove a similar inequality as in Lemma 5.16 by applying Φ(w) = 1/2∥w∥2H , which is stated as follows t ∇ft (wt )⊤ (wt − z) ≤ Bt (z, zt−1 ) − Bt (z, zt ) ] 1[ 2 2 2 ∥wt − zt−1 ∥H + ∥wt − zt ∥H . + ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 − t t Ht 2 Then by applying inequality in (5.16) for strictly convex functions, we obtain the following 152 ft (wt ) − ft (z) ≤ Bt (w, zt−1 ) − Bt (w, zt ) − β∥wt − z∥2M t ] [ 1 + ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2 −1 − ∥wt − zt−1 ∥2H + ∥wt − zt ∥2H , t t Ht 2 where Mt = ∇ft (wt )∇ft (wt )⊤ and Ht = I + βG2 I + β (5.19) ∑t−1 τ =0 Mτ as defined above. The following corollary shows that general OMP method attains a logarithmic gradual variation bound and its proof is deferred to later. Corollary 5.12. Let f1 , f2 , . . . , fT be a sequence of β-strictly convex and L-smooth functions √ with gradients bounded by G. We assume 8dL2 ≥ 1, otherwise we can set L = 1/(8d). An algorithm that adopts the updates in (5.18) has a regret bounded by T ∑ ft (wt ) − min w∈W t=1 where EGVT,2 = 5.4.5 T ∑ t=1 ft (w) ≤ 1 + βG2 8d + ln max(16dL2 , βEGVT,2 ), 2 β ∑T −1 2 t=0 ∥∇ft+1 (wt ) − ∇ft (wt )∥2 and d is the dimension of w ∈ W. Gradual Variation Bounds which Hold Uniformly over Time As mentioned in Remark 5.4, the algorithms presented in this chapter rely on the previous knowledge of the gradual variation EGVT,2 to tune the learning rate η to obtain the optimal bound. Here, we show that the Algorithm 3 can be used as a black-box to achieve the same regret bound but without any prior knowledge of the EGVT,2 . We note that the analysis here is not specific to Algorithm 3 and it is general enough to be adapted to other algorithms in the chapter too. The main idea is to run the algorithm in epochs with a fixed learning rate ηk = η0 /2k for 153 kth epoch where η0 is a fixed constant and will be decided by analysis. We denote the number of epochs by K and let bk denote the start of kth epoch. We note that bK+1 = T +1. Within kth epoch, the algorithm ensures that the inequality ηk ∑bk+1 −1 t=bk ∥∇ft+1 (zt ) − ∇ft (zt )∥22 ≤ L2 ηk−1 holds. To this end, the algorithm computes and maintains the quantity t ∑ ∥∇fs+1 (zs ) − ∇fs (zs )∥22 s=bk and sets the beginning of new epoch to be bk+1 = min t t ∑ ∥∇fs+1 (zs ) − ∇fs (zs )∥22 > L2 ηk−1 , s=bk i.e., the first iteration for which the invariant is violated. We note that this decision can only be made after seeing the tth cost function. Therefor, we burn the first iteration of each epoch which causes an extra regret of KL2 in the total regret. From the analysis we have: T ∑ t=1 ft (wt ) − min w∈W T ∑ ft (w) ≤ t=1 K ∑ k=1 ≤ ≤ b  −1 k+1 ∑ t=bk ft (wt ) − min bk+1 −1 ∑ w∈W  ft (w) t=bk K ∑ η L + k EGVb :b + KL2 k k+1 −1 2ηk 2L k=1 K ∑ k=1 ∑ L L + + K = 2L ηk−1 + KL2 2ηk 2ηk K k=1 where the first inequality follows the analysis of algorithm for each epoch, the last inequality follows the invariant maintained within each phase and the constant KL2 is due to burning the first iteration of each epoch. We now try to upper bound the last term. We first note that ∑K−1 −1 k −1 K −1 K −1 K −1 k=1 ηk = k=1 η0 2 + η0 2 = η0 (2 − 1) + η0 2 ≤ ∑K 154 η0−1 2K+1 . Furthermore, from bK , we know that ηK−1 ∑bK 2 t=bK−1 ∥∇ft+1 (zt ) − ∇ft (zt )∥2 ≥ −1 since the b is the first iteration within epoch K − 1 which violates the invariL2 ηK−1 K ant. Also, from the monotonicity of gradual variation one can obtain that ηK−1 EGVT,2 ≥ ∑b K −1 which indicates η −1 ≤ √EGV ∥∇ft+1 (zt ) − ∇ft (zt )∥22 ≥ L2 ηK−1 ηK−1 t=b T,2 /L. K−1 K−1 Putting these together, from (5.20) we obtain: T ∑ t=1 ft (wt ) − min w∈W T ∑ t=1 ft (w) ≤ 2L √ 2K+1 + KL2 ≤ 8 EGVT,2 + KL2 . η0 (5.20) It remains to bound the number of epochs in terms of EGVT,2 . A simple idea would be to set K to be ⌊log EGVT,2 ⌋ + 1, since it is the maximum number of epochs that could exists. Alternatively, we can also bound K in terms of √ EGVT,2 which worsen the constant factor in the bound but results in a bound similar to one obtained by setting optimal η. 5.5 Bandit Online Mirror Prox with Gradual Variation Bounds Online convex optimization becomes more challenging when the learner only receives partial feedback about the cost functions. One common scenario of partial feedback is that the learner only receives the cost ft (wt ) at the predicted point wt but without observing the entire cost function ft (·). This setup is usually referred to as bandit setting, and the related online learning problem is called online bandit convex optimization. Recently Hazan et al [70] extended the FTRL algorithm to online bandit linear optimiza√ tion and obtained a variation-based regret bound in the form of O(poly(d) VariationT log(T )+ 155 poly(d log(T ))), where VariationT is the total variation of the cost vectors. We continue this line of work by proposing algorithms for general online bandit convex optimization with a variation-based regret bound. We present a deterministic algorithm for online bandit convex optimization by extending the OPM algorithm to a multi-point bandit setting, and prove the variation-based regret bound, which is optimal when the variation is independent of the number of trials. In our bandit setting , we assume we are allowed to query d + 1 points around the decision point wt . To develop a variation bound for online bandit convex optimization, we follow [5] by considering the multi-point bandit setting, where at each trial the player is allowed to query the cost functions at multiple points. We propose a deterministic algorithm to compete against the completely adaptive adversary that can choose the cost function ft (w) with the knowledge of w1 , · · · , wt . To approximate the gradient ∇ft (wt ), we query the cost function to obtain the cost values at ft (wt ), and ft (wt +δei ), i = 1, · · · , d, where ei is the ith standard base in Rd . Then we compute the estimate of the gradient ∇ft (wt ) by 1∑ gt = (ft (wt + δei ) − ft (wt )) ei . δ d (5.21) i=1 It can be shown that [5], under the smoothness assumption in (5.8), √ ∥gt − ∇ft (wt )∥2 ≤ dLδ . 2 (5.22) To prove the regret bound, besides the smoothness assumption of the cost functions, and the boundness assumption about the domain W ⊆ B, we further assume that (i) there exists r ≤ 1 such that rB ⊆ W ⊆ B, and (ii) the cost function themselves are Lipschitz continuous, i.e., there exists a constant G such that 156 Algorithm 6 Deterministic Online Bandit Convex Optimization 1: Input: η, α, δ > 0 2: Initialize:: z0 = 0 and f0 (w) = 0 3: for t = 1, . . . , T do 4: Compute wt by { } G 2 wt = arg min ⟨w, gt−1 ⟩ + ∥w − zt−1 ∥2 2η w∈(1−α)W 5: 6: Observe ft (wt ), ft (wt + δei ), i = 1, · · · , d Update zt by { G zt = arg min ⟨w, gt ⟩ + ∥w − zt−1 ∥22 2η w∈(1−α)W } 7: end for |ft (w) − ft (w′ )| ≤ G∥w − w′ ∥2 , ∀w, w′ ∈ W, ∀t. (5.23) For our purpose, we define another gradual variation of cost functions by EGVcT = T∑ −1 t=0 max |ft+1 (w) − ft (w)|. w∈W (5.24) Unlike the gradual variation defined in (5.9) that uses the gradient of the cost functions, the gradual variation in (5.24) is defined according to the values of cost functions. The reason why we bound the regret by the gradual variation defined in (5.24) by the values of the cost functions rather than the one defined in (5.9) by the gradient of the cost functions is that in the bandit setting, we only have point evaluations of the cost functions. The following theorem states the regret bound for Algorithm 6. Theorem 5.13. Let ft (·), t = 1, . . . , T be a sequence of G-Lipschitz √ continuous convex func√ √ 4d max( 2G, EGVcT ) √ , tions, and their gradients are L-Lipschitz continuous. By setting δ = ( dL + G(1 + 1/r))T { } δ 1 G η = min √ , √ , and α = δ/r, we have the following regret bound for Algo4d EGVcT 2 157 rithm 6 T ∑ ft (wt ) − min T ∑ w∈W t=1 √ ft (w) ≤ 4 max √ (√ ) 2G, EGVcT d (dL + G/r) T. (5.25) t=1 Remark 5.14. Similar to the regret bound in [5](Theorem 9), Algorithm 6 also gives the √ optimal regret bound O( T ) when the variation is independent of the number of trials. Our regret bound has a better dependence on d (i.e., d) compared with the regret bound in [5] (i.e., d2 ). 5.6 5.6.1 Proofs of Gradual Variation Proof of Theorem 5.3 To prove Theorem 5.3, we first present the following lemma. Lemma 5.15. Let f1 , f2 , . . . , fT be a sequence of convex functions with L-Lipschitz continuous gradients. By running Algorithm 3 over T trials, we have T ∑ t=1  L ft (wt ) ≤ min  ∥w∥22 + w∈W 2η T ∑  ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ t=1 T −1 η ∑ + ∥∇ft+1 (zt ) − ∇ft (zt )∥22 . 2L t=0 With this lemma, we can easily prove Theorem 5.3 by exploring the convexity of ft (w). Proof of Theorem 5.3. By using ∥w∥2 ≤ 1, ∀w ∈ W ⊆ B, and the convexity of ft (w), we 158 have  L min ∥w∥22 + w∈W  2η T ∑ t=1   ∑ L ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ ≤ ft (w). + min  2η w∈W T t=1 Combining the above result with Lemma 5.15, we have T ∑ ft (wt ) − min w∈W t=1 By choosing η = min(1, L/ T ∑ ft (w) ≤ t=1 η L + EGVT,2 . 2η 2L √ EGVT,2 ), we have the regret bound claimed in Theorem 5.3. The Lemma 5.15 is proved by induction. The key to the proof is that zt is the optimal solution to the strongly convex minimization problem in Lemma 5.15, i.e., [ ∑ L zt = arg min ∥w∥22 + fτ (zτ −1 ) + ⟨w − zτ −1 , ∇fτ (zτ −1 )⟩ w∈W 2η t ] τ =1 Proof of Lemma 5.15. We prove the inequality by induction. When T = 1, we have w1 = z0 = 0 and [ ] L η 2 min ∥w∥2 + f1 (z0 ) + ⟨w − z0 , ∇f1 (z0 )⟩ + ∥∇f1 (z0 )∥22 2L w∈W 2η { } L η 2 2 ≥ f1 (z0 ) + ∥∇f1 (z0 )∥2 + min ∥w∥2 + ⟨w − z0 , ∇f1 (z0 )⟩ w 2L 2η = f1 (z0 ) = f1 (w1 ). where the inequality follows that by relaxing the minimization domain w ∈ W to the whole space. We assume the inequality holds for t and aim to prove it for t + 1. To this end, we define 159 [ ] t ∑ L ψt (w) = ∥w∥22 + fτ (zτ −1 ) + ⟨w − zτ −1 , ∇fτ (zτ −1 )⟩ 2η τ =1 + η 2L t−1 ∑ ∥∇fτ +1 (zτ ) − ∇fτ (zτ )∥22 . τ =0 According to the updating procedure for zt in step 6, we have zt = arg minw∈W ψt (w). Define ϕt = ψt (zt ) = minw∈W ψt (w). Since ψt (w) is a (L/η)-strongly convex function, we have L ∥w − zt ∥22 + ⟨w − zt , ∇ψt+1 (zt )⟩ 2η L = ∥w − zt ∥22 + ⟨w − zt , ∇ψt (zt ) + ∇ft+1 (zt )⟩. 2η ψt+1 (w) − ψt+1 (zt ) ≥ Setting w = zt+1 = arg minw∈W ψt+1 (w) in the above inequality results in ψt+1 (zt+1 ) − ψt+1 (zt ) = ϕt+1 − (ϕt + ft+1 (zt ) + η ∥∇ft+1 (zt ) − ∇ft (zt )∥22 ) 2L L ∥z − zt ∥22 + ⟨zt+1 − zt , ∇ψt (zt ) + ∇ft+1 (zt )⟩ 2η t+1 L ≥ ∥zt+1 − zt ∥22 + ⟨zt+1 − zt , ∇ft+1 (zt )⟩, 2η ≥ (5.26) where the second inequality follows from the fact zt = arg minw∈W ψt (w), and therefore (w − zt )⊤ ∇ψt (zt ) ≥ 0, ∀w ∈ W. Moving ft+1 (zt ) in the above inequality to the right hand side, we have 160 ϕt+1 − ϕt − η ∥∇ft+1 (zt ) − ∇ft (zt )∥22 2L L ∥z − zt ∥22 + ⟨zt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt ) 2η t+1 { } L 2 ≥ min ∥w − zt ∥2 + ⟨w − zt , ∇ft+1 (zt )⟩ + ft+1 (zt ) w∈W 2η         L  2 = min ∥w − zt ∥2 + ⟨w − zt , ∇ft (zt )⟩ +ft+1 (zt ) + ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩ .  2η w∈W        r(w) ≥ ρ(w) (5.27) To bound the right hand side, we note that wt+1 is the minimizer of ρ(w) by step 4 in Algorithm 3, and ρ(w) is a L/η-strongly convex function, so we have ρ(w) ≥ ρ(wt+1 ) + ⟨w − wt+1 , ∇ρ(wt+1 )⟩ + ≥0 L L ∥w − wt+1 ∥22 ≥ ρ(wt+1 ) + ∥w − wt+1 ∥22 . 2η 2η Then we have L ∥w − wt+1 ∥22 + r(w) 2η L L = ∥wt+1 − zt ∥22 + ⟨wt+1 − zt , ∇ft (zt )⟩ +ft+1 (zt ) + ∥w − wt+1 ∥22 + r(w) 2η 2η ρ(w) + ft+1 (zt ) + r(w) ≥ ρ(wt+1 ) + ft+1 (zt ) + ρ(wt+1 ) Plugging above inequality into the inequality in (5.27), we have 161 ϕt+1 − ϕt − ≥ η ∥∇ft+1 (zt ) − ∇ft (zt )∥22 2L L ∥w − zt ∥22 + ⟨wt+1 − zt , ∇ft (zt ) + ft+1 (zt )⟩ 2η t+1 } { L 2 ∥w − wt+1 ∥2 + ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩ + min w∈W 2η To continue the bounding, we proceed as follows ϕt+1 − ϕt − ≥ = ≥ η ∥∇ft+1 (zt ) − ∇ft (zt )∥22 2L L ∥w − zt ∥22 + ⟨wt+1 − zt , ∇ft (zt )⟩ + ft+1 (zt ) 2η t+1 { } L 2 + min ∥w − wt+1 ∥2 + ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩ w∈W 2η L ∥w − zt ∥22 + ⟨wt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt ) 2η t+1 } { L 2 ∥w − wt+1 ∥2 + ⟨w − wt+1 , ∇ft+1 (zt ) − ∇ft (zt )⟩ + min w∈W 2η L ∥w − zt ∥22 + ⟨wt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt ) 2η t+1 } { L 2 ∥w − wt+1 ∥2 + ⟨w − wt+1 , ∇ft+1 (zt ) − ∇ft (zt )⟩ + min w 2η L η ∥wt+1 − zt ∥22 + ⟨wt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt ) − ∥∇ft+1 (zt ) − ∇ft (zt )∥22 2η 2L η ≥ft+1 (wt+1 ) − ∥∇ft+1 (zt ) − ∇ft (zt )∥22 , 2L = where the first equality follows by writing ⟨wt+1 − zt , ∇ft (zt )⟩ = ⟨wt+1 − zt , ∇ft+1 (zt )⟩− ⟨wt+1 − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩ and combining with ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩, and the last inequality follows from the smoothness condition of ft+1 (w). Since by induction ϕt ≥ ∑t τ =1 fτ (wτ ), we have ϕt+1 ≥ ∑t+1 τ =1 fτ (wτ ). 162 5.6.2 Proof of Theorem 5.5 To prove Theorem 5.5, we need the following lemma, which is the Lemma 3.1 in [117] stated in our notations. Lemma 5.16 (Lemma 3.1 [117]). Let Φ(z) be a α-strongly convex function with respect to the norm ∥ · ∥, whose dual norm is denoted by ∥ · ∥∗ , and B(w, z) = Φ(w) − (Φ(z) + (w − z)⊤ Φ′ (z)) be the Bregman distance induced by function Φ(w). Let Z be a convex compact set, and U ⊆ Z be convex and closed. Let z ∈ Z, γ > 0, Consider the points, w = arg min γu⊤ ξ + B(u, z), (5.28) z+ = arg min γu⊤ ζ + B(u, z), (5.29) u∈U u∈U then for any u ∈ U , we have γζ ⊤ (w − u) ≤ B(u, z) − B(u, z+ ) + γ2 α ∥ξ − ζ∥2∗ − [∥w − z∥2 + ∥w − z+ ∥2 ]. α 2 In order not to have readers struggle with complex notations in [117] for the proof of Lemma 5.16, we present a detailed proof later in Appendix A.3 which is an adaption of the original proof to our notations. Theorem 5.5 can be proved by using the above lemma, because the updates of wt , zt can be written equivalently as (5.28). The proof below starts from (5.16) and bounds the summation of each term over t = 1, . . . , T , respectively. Proof of Theorem 5.5. First, we note that the two updates in step 4 and step 6 of Algorithm 4 fit in the Lemma 5.16 if we let U = Z = W, z = zt−1 , w = wt , z+ = zt , and Φ(w) = 21 ∥w∥22 , 163 which is 1-strongly convex function with respect to ∥ · ∥2 . Then B(u, z) = 12 ∥u − z∥22 . As a result, the two updates for wt , zt in Algorithm 4 are exactly the updates in (5.28) with z = zt−1 , γ = η/L, ξ = ∇ft−1 (zt−1 ), and ζ = ∇ft (wt ). Replacing these into (5.16), we have the following inequality for any u ∈ W, ) η 1( ⊤ 2 2 (wt − u) ∇ft (wt ) ≤ ∥u − zt−1 ∥2 − ∥u − zt ∥2 L 2 ) 1( η2 ∥wt − zt−1 ∥22 + ∥wt − zt ∥22 + 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥22 − 2 L (5.30) Then we have ) η η 1( 2 2 ⊤ ∥u − z ∥ − ∥u − z ∥ (ft (wt ) − ft (u)) ≤ (wt − u) ∇ft (wt ) ≤ t 2 t−1 2 L L 2 2η 2 2η 2 + 2 ∥∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22 + 2 ∥∇ft (wt ) − ∇ft (wt−1 )∥22 L L ) ( 1 − ∥wt − zt−1 ∥22 + ∥wt − zt ∥22 2 ) 2η 2 1( 2 2 ≤ ∥u − zt−1 ∥2 − ∥u − zt ∥2 + 2 ∥∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22 2 L ) ( 1 2 2 2 2 + 2η ∥wt − wt−1 ∥2 − ∥wt − zt−1 ∥2 + ∥wt − zt ∥2 , 2 (5.31) where the first inequality follows the convexity of ft (w), and the third inequality follows the smoothness of ft (w). By taking the summation over t = 1, · · · , T with z∗ = arg min u∈W 164 ∑T t=1 ft (u), and dividing both sides by η/L, we have T ∑ ft (wt ) − min w∈W t=1 T −1 2η ∑ L ft (w) ≤ + ∥∇ft+1 (wt ) − ∇ft (wt )∥22 2η L ∑ t=1 T ∑ + t=0 2η 2 ∥wt − wt−1 ∥22 − t=1 T ∑ 1( t=1 2 ∥wt − zt−1 ∥22 + ∥wt − zt ∥22 ) ≜BT We can bound BT as follows: BT = ≥ T T +1 1∑ 1∑ ∥wt − zt−1 ∥22 + ∥wt−1 − zt−1 ∥22 2 2 1 2 1 ≥ 4 t=1 T ( ∑ t=2 T ∑ t=2 t=2 ∥wt − zt−1 ∥22 + ∥wt−1 − zt−1 ∥22 1 ∥wt − wt−1 ∥22 = 4 T ∑ ) ∥wt − wt−1 ∥22 t=1 where the last equality follows that w1 = w0 . Plugging the above bound into (5.32), we have T ∑ t=1 ft (wt ) − min w∈W ∑ ft (w) ≤ t=1 T ( ∑ 2η 2 − + t=1 T −1 L 2η ∑ + ∥∇ft+1 (wt ) − ∇ft (wt )∥22 2η L 1 4 t=0 ) ∥wt − wt−1 ∥22 We complete the proof by plugging the value of η. 5.6.3 Proof of Corollary 5.12 We first have the key inequality in (5.19): for any z ∈ W 165 ft (wt ) − ft (z) ≤ Bt (z, zt−1 ) − Bt (z, zt ) − β∥wt − z∥2M t ] [ 1 + ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2 −1 − ∥wt − zt−1 ∥2H + ∥wt − zt ∥2H . t t Ht 2 Taking summation over t = 1, . . . , T , we have T ∑ ft (wt ) − t=1 T ∑ ft (z) ≤ t=1 + T ∑ t=1 T ∑ (Bt (z, zt−1 ) − Bt (z, zt )) − t=1 T ∑ t=1 ≜At ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2 −1 − H T ∑ 1[ t t=1 2 β∥wt − z∥2M t ≜Ct ∥wt − zt−1 ∥2H + ∥wt − zt ∥2H t ] t ≜Bt ≜St Next we bound each term individually. First, T ∑ At = B1 (z, z0 ) − BT (z, zT ) + T∑ −1 (Bt+1 (z, zt ) − Bt (z, zt )) (5.32) t=1 t=1 Note that B1 (z, z0 ) = 21 (1 + βG2 )∥z∥22 ≤ 12 (1 + βG2 ) for any z ∈ W, and Bt+1 (z, zt ) − Bt (w, zt ) = β2 ∥z − zt ∥2H , therefore t T ∑ t=1 ∑β 1 At ≤ (1 + βG2 ) + ∥z − zt ∥2H t 2 2 T t=1 166 (5.33) Then T ∑ t=1 ] T [ ∑ 1 β 2 2 2 (At − Ct ) ≤ (1 + βG ) + ∥z − zt ∥M − β∥wt − z∥M t t 2 2 1 ≤ (1 + βG2 ) + 2 1 ≤ (1 + βG2 ) + 2 t=1 T [ ∑ t=1 T ∑ t=1 ] β∥z − wt ∥2M + β∥wt − zt ∥2M − β∥wt − z∥2M t t 1 β∥wt − zt ∥2M ≤ (1 + βG2 ) + t 2 t T ∑ ∥wt − zt ∥2H t=1 t (5.34) Noting the updates in (5.18) and from inequality in (A.9) in the proof of Lemma 5.5 in Appendix A.3, we can get ∥wt − zt ∥Ht ≤ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 H t Next, we bound T ∑ t=1 ∑T t=1 Bt . T −1 T 1∑ 1∑ 2 ∥wt+1 − zt ∥H + ∥wt − zt ∥2H Bt = t t+1 2 2 ≥ 1 2 1 ≥ 4 ≥ 1 4 t=0 T∑ −1 t=1 T∑ −1 t=1 T∑ −1 ∥wt+1 − zt ∥2H + t 1 2 t=1 T∑ −1 ∥wt − zt ∥2H t=1 T∑ −1 1 ∥wt+1 − wt ∥2H ≥ t 4 t=0 t (5.35) 1 ∥wt+1 − wt ∥22 − ∥w1 − w0 ∥22 4 ∥wt+1 − wt ∥22 , t=0 where the last inequality follows that w0 = w1 = 0. Therefore, 167 T ∑ ft (wt ) − t=1 T ∑ t=1 ∑ 1 ft (z) ≤ (1 + βG2 ) + 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 Ht 2 T t=1 − 1 4 T∑ −1 (5.36) ∥wt+1 − wt ∥22 t=0 To proceed, we need the following lemma. Lemma 5.17. We have T ∑ t=1  T ∑ β 4d  ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 ≤ ln 1 + ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥22  Ht β 4  t=1 (5.37) Thus,  T ∑ 4d  β ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 ≤ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥22  ln 1 + Ht β 4 t=1 t=1   T ∑ 4d  β ≤ ln 1 + ∥∇ft (wt ) − ∇ft (wt−1 ) + ∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22  β 4 t=1   T T β∑ 4d  β∑ 2 ≤ ln 1 + L ∥wt − wt−1 ∥22 + ∥∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22  β 2 2 t=1 t=1   T∑ −1 T∑ −1 4d  β β ≤ ln 1 + L2 ∥wt+1 − wt ∥22 + ∥∇ft+1 (wt ) − ∇ft (wt )∥22  β 2 2 t=0 t=0   T∑ −1 4d  β β ≤ ln 1 + L2 ∥wt+1 − wt ∥22 + EGVT,2  β 2 2  T ∑ t=0 (5.38) 168 Then, T ∑ ft (wt ) − t=1 T ∑ t=1  8d  β 1 ft (z) ≤ (1 + βG2 ) + ln 1 + 2 β 2 − T∑ −1  L2 ∥wt+1 − wt ∥22 + t=0 β EGVT,2  2 T −1 1∑ ∥wt+1 − wt ∥22 4 t=0 (5.39) Without loss of generality we assume 8dL2 ≥ 1. Next, let us consider two cases. In the first case, we assume βEGVT,2 ≤ 16dL2 . Then   T∑ −1 8d  β β ln 1 + L2 ∥wt+1 − wt ∥22 + EGVT,2  β 2 2 t=0   T∑ −1 β 8d  L2 ∥wt+1 − wt ∥22 + 8dL2  ln 1 + ≤ β 2 t=0   T∑ −1 8d  β ≤ ln L2 ∥wt+1 − wt ∥22 + 16dL2  β 2 t=0 [ ( ∑ )] β T −1 2 2 L ∥w − w ∥ 8d t t+1 2 +1 = ln 16dL2 + ln 2 t=0 β 16dL2 ∑T −1 β ∑ −1 2 2 L ∥wt+1 − wt ∥22 8d 8d 8d 2 Tt=0 2 2 t=0 ∥wt+1 − wt ∥2 = ≤ ln 16dL + ln 16dL + β β β 4 16dL2 (5.40) where the last inequality follows ln(1 + x) ≤ x for x ≥ 0. Then we get T ∑ t=1 ft (wt ) − T ∑ t=1 8d 1 ln 16dL2 ft (z) ≤ (1 + βG2 ) + 2 β In the second case, we assume βEGVT,2 ≥ 16dL2 ≥ 2, then we have 169 (5.41)   T∑ −1 8d  β β ln 1 + L2 ∥wt+1 − wt ∥22 + EGVT,2  β 2 2 t=0   T∑ −1 8d  β ln ≤ L2 ∥wt+1 − wt ∥22 + βEGVT,2  β 2 t=0 [ ( ∑ )] β T −1 2 2 L ∥w − w ∥ 8d t 2 t+1 ≤ ln(βEGVT,2 ) + ln 2 t=0 +1 β βEGVT,2 β ∑ −1 2 L ∥wt+1 − wt ∥22 8d 8d 2 Tt=0 = ln(βEGVT,2 ) + β β βEGVT,2 ∑ −1 4dL2 Tt=0 ∥wt+1 − wt ∥22 8d = ln(βEGVT,2 ) + β βEGVT,2 ∑T −1 ∥wt+1 − wt ∥22 8d ln(βEGVT,2 ) + t=0 ≤ β 4 (5.42) where the last inequality follows βEGVT,2 ≥ 16dL2 . Then we get T ∑ ft (wt ) − T ∑ t=1 t=1 1 8d ft (z) ≤ (1 + βG2 ) + ln(βEGVT,2 ) 2 β (5.43) Thus, we complete the proof by combining the two cases. Next, we prove Lemma 5.17. We need the following lemma, which can be proved by using Lemma 6 [68] and noting that |I + ∑t ∑T ⊤ 2 d τ =1 uτ uτ | ≤ (1 + t=1 ∥ut ∥2 ) , where | · | denotes the determinant of a matrix. Lemma 5.18. Let u1 , u2 , · · · , uT ∈ Rd be a sequence of vectors. Let Vt = I + ∑t ⊤ τ =1 uτ uτ . Then, T ∑  −1  u⊤ t Vt ut ≤ d ln 1 + T ∑  ∥ut ∥22  (5.44) t=1 t=1 To prove Lemma 5.17, we let vt = ∇ft (wt ), t = 1, . . . , T and v0 = 0. Then Ht = 170 I + βG2 I + β ∑t−1 ⊤ τ =0 vτ vτ . Note that we assume ∥∇ft (w)∥2 ≤ G, therefore t β∑ ⊤ vτ vτ ≥ I + (vτ vτ⊤ + vτ −1 vτ⊤−1 ) Ht ≥ I + β 2 τ =1 τ =1 t β∑ ≥I+ (vτ − vτ −1 )(vτ − vτ −1 )⊤ = Vt 4 τ =1 t ∑ (5.45) √ ∑ Let ut = ( β/2)(vt − vt−1 ), then Vt = I + tτ =1 uτ u⊤ τ . By applying the above lemma, we have   T T ∑ ∑ β β (vt − vt−1 )⊤ Vt−1 (vt − vt−1 ) ≤ d ln 1 + ∥vt − vt−1 ∥22  4 4 t=1 (5.46) t=1 Thus, T ∑ T ∑ (vτ − vτ −1 )⊤ Vt−1 (vτ − vτ −1 ) (vτ − vτ −1 )⊤ H−1 t (vτ − vτ −1 ) ≤ t=1 t=1   T ∑ β 4d  ln 1 + ∥vt − vt−1 ∥22  ≤ β 4 (5.47) t=1 5.6.4 Proof of Theorem 5.13 Let ht (w) = ft (w) + ⟨gt − ∇ft (wt ), w⟩. It is easy seen that ∇ht (wt ) = gt . Followed by Lemma 5.16, we have for any z ∈ (1 − α)W, ) η2 η 1( ⊤ 2 2 ∇ht (wt ) (wt − z) ≤ ∥z − zt−1 ∥2 − ∥z − zt ∥2 + 2 ∥gt − gt−1 ∥22 G 2 G ( ) 1 ∥wt − zt−1 ∥22 + ∥wt − zt ∥22 − 2 Taking summation over t = 1, . . . , T , we have, 171 (5.48) T T ∑ ∥z − z0 ∥22 ∑ η 2 η ⊤ ∥gt − gt−1 ∥22 ∇ht (wt ) (wt − z) ≤ + G 2 G2 t=1 − t=1 T ∑ t=1 ) 1( ∥wt − zt−1 ∥22 + ∥wt − zt ∥22 2 ∑1 ∥z − z0 ∥22 ∑ η 2 2− ≤ + ∥g − g ∥ ∥wt − wt−1 ∥22 t t−1 2 2 2 4 G ≤ ≤ T T t=1 t=1 ∑1 1 ∑ η2 2− + ∥g − g ∥ ∥wt − wt−1 ∥22 t t−1 2 2 2 4 G 1 + 2 − T T t=1 T ∑ t=1 T ∑ t=1 T ∑ t=1 2η 2 ∥gt − gt−1 ∥22 + G2 t=1 2η 2 ∥gt − gt−1 ∥22 G2 1 ∥wt − wt−1 ∥22 4 T d 1 2η 2 ∑ ∑ ≤ + 2 2 (ft (wt + δei ) − ft (wt−1 + δei ))ei − (ft (wt ) − ft (wt−1 ))ei 2 δ G t=1 i=1 2 2 T d 2η 2 ∑ ∑ + 2 2 (ft (wt−1 + δei ) − ft−1 (wt−1 + δei ))ei − (ft (wt−1 ) − ft−1 (wt−1 ))ei δ G t=1 i=1 − T ∑ t=1 2 2 1 ∥wt − wt−1 ∥22 , 4 (5.49) where the second inequality follows (5.32). Next, we bound the middle two terms in right hand side of the above inequality. 172 T d ∑ ∑ 2 (ft (wt + δei ) − ft (wt−1 + δei ))ei − (ft (wt ) − ft (wt−1 ))ei t=1 i=1 ≤ ≤ T ∑ t=1 T ∑ 2d 2 d ( ∑ |ft (wt + δei ) − ft (wt−1 + δei )|2 + |ft (wt ) − ft (wt−1 )|2 ) (5.50) i=1 4d2 G2 ∥wt − wt−1 ∥22 , t=1 and d T ∑ ∑ 2 (ft (wt−1 + δei ) − ft−1 (wt−1 + δei ))ei − (ft (wt−1 ) − ft−1 (wt−1 ))ei t=1 i=1 ≤ ≤ T ∑ t=1 T ∑ t=1 2d 2 d ( ∑ |ft (wt−1 + δei ) − ft−1 (wt−1 + δei )|2 + |ft (wt−1 ) − ft−1 (wt−1 )|2 ) (5.51) i=1 4d2 max |ft (w) − ft−1 (w)|2 . w∈W Then we have T T ∑ η 1 8d2 η 2 ∑ ∇ht (wt )⊤ (wt − z) ≤ + ∥wt − wt−1 ∥22 G 2 δ2 t=1 t=1 T T ∑ 8d2 η 2 ∑ 1 2 max |ft (w) − ft−1 (w)| − + 2 2 ∥wt − wt−1 ∥22 4 δ G w∈W t=1 t=1 T 1 8d2 η 2 ∑ max |ft (w) − ft−1 (w)|2 ≤ + 2 2 2 δ G w∈W t=1 (5.52) √ where the last inequality follows that η ≤ δ/(4 2d). Then by using the convexity of ht (w) and dividing both sides by η/G, we have 173 T ∑ ht (wt ) − min ht ((1 − α)w) ≤ w∈W t=1 √ ) (√ 8ηd2 G c ≤ 4d max c + EVAR 2G, EVAR T T 2η δ Gδ 2 (5.53) Following the the proof of Theorem 8 in [5], we have T ∑ ft (wt ) − t=1 ≤ ≤ T ∑ ft (w) ≤ t=1 T ∑ t=1 T ∑ ht (wt ) − ht (wt ) − t=1 T ∑ t=1 T ∑ T ∑ ht (wt ) − t=1 ht (w) + T ∑ ht (w) + t=1 T ∑ T ∑ ft (wt ) − ht (wt ) − ft (w) + ht (w) t=1 ⟨gt − ∇ft (wt ), w − wt ⟩ t=1 ht (w) + √ dLδT t=1 (5.54) where the last inequality follows from the following facts: √ dLδ ∥gt − ∇ft (wt )]∥2 ≤ 2 (5.55) ∥w − wt ∥ ≤ 2 Then we have T ∑ t=1 ft (wt ) − min w∈W T ∑ t=1 √ (√ ) √ 4d c ft ((1 − α)w) ≤ max 2G, EVART + dLδT δ (5.56) By the Lipschitz continuity of ft (w), we have T ∑ t=1 ft ((1 − α)w) ≤ T ∑ t=1 174 ft (w) + GαT (5.57) The we get T ∑ ft (wt ) − min w∈W t=1 T ∑ t=1 ft (w) ≤ √ ) (√ √ 4d max 2G, EVARcT + δ dLT + αGT δ (5.58) Plugging the stated values of δ and α completes the proof. 5.7 Summary In this chapter, we proposed two novel algorithms for online convex optimization that bound the regret by the gradual variation of consecutive cost functions. The first algorithm is an improvement of the FTRL algorithm, and the second algorithm is based on the mirror prox method. Both algorithms maintain two sequence of solution points, a sequence of decision points and a sequence of searching points, and share the same order of regret bound up to a constant. The online mirror prox method only requires to keep tracking of a single gradient of each cost function, while the improved FTRL algorithm needs to evaluate the gradient of each cost function at two points and maintain a sum of up-to-date gradients of the cost functions. We note that a very recent work Chiang et al. [40] extends the prox method into a two-point bandit setting and achieves a similar regret bound in expectation as that in the ( √ ) ( ) full setting, i.e., O d2 EGVT,2 ln T for smooth functions and O d2 ln(EGVT,2 + ln T) for smooth and strongly convex cost functions, where EGVT,2 is the gradual variation defined on the gradients of the cost functions. We would like to make a thought-provoking comparison between our regret bound and their regret bound for online bandit convex optimization with smooth cost functions. First, the gradual variation in our bandit setting is defined on the 175 values of the cost functions in contrast to that defined on the gradients of the cost functions. Second, we query the cost function d times in contrast to 2 times in their algorithms, and as a tradeoff our regret bound has a better dependence on the number of dimensions (i.e., O(d)) than that (i.e., O(d2 )) of their regret bound. Third, our regret bound has an annoying √ √ factor of T in comparison with ln T in theirs. Therefore, some open problems are how to achieve a lower order of dependence on d than d2 in the two-point bandit setting, and how to √ remove the factor of T while keeping a small order of dependence on d in our multi-point bandit setting; and studying the two different types of gradual variations for bandit settings is a future work as well. 5.8 Bibliographic Notes As is well known, a wide range of literature deals with the online decision making problem and there exist a number of regret-minimizing algorithms that have the optimal regret bound. The first distribution-free framework for sequential decision making was proposed by Hannan [65] which was rediscovered in [86]. Blackwell in his seminal paper [23] generalized the Hannan’s result and concerned the problem of playing a repeated game with a vector-valued payoff function and gave a precise necessary and sufficient condition for when a set is approachable. The most well-known and successful work is probably the Hedge algorithm [58], which was a direct generalization of Littlestone and Warmuth’s Weighted Majority (WM) algorithm [99]. Another algorithm for online decision making problem is the Vovk’s so-called aggregating strategies [147]. Other recent studies include the improved theoretical bounds and the parameter-free hedging algorithm [39] and adaptive Hedge [53] for decision-theoretic online learning. We refer readers to the [38] for an in-depth discussion 176 of this subject. As we already discussed in Chapter 2, over the past decade many algorithms have been proposed for online convex optimization, especially for online linear optimization. As the first seminal paper in online convex optimization, Zinkevich [158] proposed a gradient descent al√ gorithm with a regret bound of O( T ). When cost functions are strongly convex, the regret bound of the online gradient descent algorithm is reduced to O(log T ) with appropriately chosen step size [68]. Another common methodology for online convex optimization, especially for online linear optimization, is based on the framework of Follow the Leader (FTL). FTL chooses wt by minimizing the cost incurred by wt in all previous trials. Since the naive FTL algorithm fails to achieve sublinear regret in the worst case, many variants have been developed to fix the problem, including Follow The Perturbed Leader (FTPL) [86], Follow The Regularized Leader (FTRL) [3], and Follow The Approximate Leader (FTAL) [68]. Other methodologies for online convex optimization introduce a potential function (or link function) to map solutions between the space of primal variables and the space of dual variables, and carry out primal-dual update based on the potential function. The well-known Exponentiated Gradient (EG) algorithm [89] or Multiplicative Weights algorithm [99, 58] belong to this category. We note that these different algorithms are closely related. For example, in online linear optimization, the potential-based primal-dual algorithm is equivalent to FTRL algorithm. 177 Chapter 6 Gradual Variation for Composite Losses This chapter continues our investigation and analysis of online learning methods which can lead to better regret bounds in gradually evolving environments. The results we have obtained in Chapter 5 rely on the assumption that the cost functions are smooth. Additionally, we showed that for general non-smooth functions when the only information presented to the learner is the first order information about the cost functions, it is impossible to obtain a regret bounded by gradual variation. However, in this chapter, we show that a gradual variation bound is achievable for a special class of non-smooth functions that is composed of a smooth component and a non-smooth component. We consider two categories for the non-smooth component. In the first category, we assume that the non-smooth component is a fixed function and is relatively easy such that the composite gradient mapping can be solved without too much computational overhead compared to gradient mapping. A common example that falls into this category is to consider a non-smooth regularizer. For example, in addition to the basic domain W, one would enforce the sparsity constraint on the decision vector w, i.e., ∥w∥0 ≤ k < d, which is important in feature selection. However, the sparsity constraint ∥w∥0 ≤ k is a non-convex function, and is usually implemented by adding a ℓ1 regularizer λ∥w∥1 to the objective function, where 178 λ > 0 is a regularization parameter. Therefore, at each iteration the cost function is given by ft (w) + λ∥w∥1 . To prove a regret bound by gradual variation for this type of non-smooth optimization, we first present a simplified version of the general online mirror prox method from Chapter 5 and show that it has the exactly same regret bound as stated in Chapter 5, and then extend the algorithm to the non-smooth optimization with a fixed non-smooth component. In the second category, we assume that the non-smooth component can be written as an explicit maximization structure. In general, we consider a time-varying non-smooth component, present a primal-dual prox method, and prove a min-max regret bound by gradual variation. When the non-smooth components are equal across all trials, the usual regret is bounded by the min-max bound plus a variation in the non-smooth component. To see an application of min-max regret bound, we consider the problem of online classification with hinge loss and show that the number of mistakes can be bounded by a variation in sequential examples. Before moving to the detailed analysis, it is worth mentioning that several pieces of works have proposed algorithms for optimizing the two types of non-smooth functions as described above to obtain an optimal convergence rate of O(1/T ) [122, 123]. Therefore, the existence of a regret bound by gradual variation for these two types of non-smooth optimization does not violate the contradictory argument in Section 5.3. 179 6.1 Composite Losses with a Fixed Non-smooth Component 6.1.1 A Simplified Online Mirror Prox Algorithm In this subsection, we present a simplified version of online mirror prox (OMP) method algorithm proposed in Chapter 5, which is the foundation for us to develop the algorithm for non-smooth optimization. The key trick is to replace the domain constraint w ∈ W with a non-smooth function in the objective. Let δW (w) denote the indicator function of the domain W, i.e.,    0,     δW (w) = w∈W       ∞, otherwise Then the proximal gradient mapping for updating w (step 4) in Algorihtm 5 is equivalent to wt = arg min⟨w, ∇ft−1 (wt−1 )⟩ + w L B(w, zt−1 ) + δW (w). η By the first order optimality condition, there exists a sub-gradient vt ∈ ∂δW (wt ) such that ∇ft−1 (wt−1 ) + L (∇Φ(wt ) − ∇Φ(zt−1 )) + vt = 0. η (6.1) Thus, wt is equal to wt = arg min⟨w, ∇ft−1 (wt−1 ) + vt ⟩ + w 180 L B(w, zt−1 ). η (6.2) Then we can change the update for zt to zt = arg min⟨w, ∇ft (wt ) + vt ⟩ + w L B(w, zt−1 ). η (6.3) The key ingredient of above update compared to step 6 in Algorithm 5 is that we explicitly use the sub-gradient vt that satisfies the optimality condition for wt instead of solving a domain constrained optimization problem. The advantage of updating zt by (6.3) is that we can easily compute zt by the first order optimality condition, i.e., ∇ft (wt ) + vt + L (∇Φ(zt ) − ∇Φ(zt−1 )) = 0. η (6.4) Note that Eq. (6.1) indicates vt = −∇ft−1 (wt−1 ) − ∇Φ(wt ) + ∇Φ(zt−1 ). By plugging this into (6.4), we reach to the following simplified update for zt , ∇Φ(zt ) = ∇Φ(wt ) + η (∇ft−1 (wt−1 ) − ∇ft (wt )). L The simplified version of Algorithm 5 is presented in Algorithm 7. Remark 6.1. We make three remarks for Algorithm 7. First, the searching point zt does not necessarily belong to the domain W, which is usually not a problem given that the decision point wt is always in W. Nevertheless, the update can be followed by a projection step zt = minw∈W B(w, z′t ) to ensure the searching point also stay in the domain W, where we η slightly abuse the notation z′t in ∇Φ(z′t ) = ∇Φ(wt ) + L (∇ft−1 (wt−1 ) − ∇ft (wt )). 181 Algorithm 7 A Simplified General Online Mirror Prox Method 1: Input: η > 0, Φ(z) 2: Initialize:: z0 = w0 = minz∈W Φ(z) and f0 (w) = 0 3: for t = 1, . . . , T do 4: Predict wt by { } L wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 ) η w∈W 5: 6: Receive cost function ft (·) and incur loss ft (wt ) Update zt by solving ∇Φ(zt ) = ∇Φ(wt ) + η (∇ft−1 (wt−1 ) − ∇ft (wt )). L 7: end for Second, the update in step 6 can be implemented by [38, Chapter 11]: zt = ∇Φ∗ (∇Φ(wt ) + η (∇ft−1 (wt−1 ) − ∇ft (wt ))), L where Φ∗ (·) is the Legendre-Fenchel conjugate of Φ(·) (see Appendix ?? for definition). For example, when Φ(w) = 1/2∥w∥22 , Φ∗ (w) = 1/2∥w∥22 and the update for the searching point is given by zt = wt + ( η/L)(∇ft−1 (wt−1 ) − ∇ft (wt )); when Φ(w) = ∑ ∗ i wi ln wi , Φ (w) = log [ ∑ i exp(wi )] and the update for the searching point can be computed by [zt ]i ∝ [wt ]i exp (η/L[∇ft−1 (wt−1 ) − ∇ft (wt )]) , s.t. ∑ [zt ]i = 1. i Third, the key inequality in (5.16) for proving the regret bound still hold for ζ = ∇ft (wt )+ vt , ξ = ∇ft−1 (wt−1 ) + vt by noting the equivalence between the pairs (6.2, 6.3) and (5.28, 5.29), which is given below: 182 η ⟨∇ft (wt ) + vt , wt − w⟩ ≤ B(w, zt−1 ) − B(w, zt ) L γ2 α + ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2∗ − [∥wt − zt−1 ∥2 + ∥w − zt ∥2 ], ∀w ∈ W, α 2 where vt ∈ ∂δW (wt ). As a result, we can apply the same analysis as in the proof of Theorem 5.5 to obtain the same regret bound in Theorem 5.6 for Algorithm 7. Note that the above inequality remains valid even if we take a projection step after the update for z′t due to the generalized pythagorean inequality B(w, zt ) ≤ B(w, z′t ), ∀w ∈ W [38]. 6.1.2 A Gradual Variation Bound for Online Non-Smooth Optimization In spirit of Algorithm 7, we present an algorithm for online non-smooth optimization of functions ft (w) = ft (w) + g(w) with a regret bound by gradual variation as: EGVT = T∑ −1 ∥∇ft+1 (wt ) − ∇ft (wt )∥2∗ . t=0 The trick is to solve the composite gradient mapping: wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + w∈W L B(w, zt−1 ) + g(w) η and update zt by ∇Φ(zt ) = ∇Φ(wt ) + η (∇ft−1 (wt−1 ) − ∇ft (wt )). L 183 Algorithm 8 Online Mirror Prox Method with a Fixed Non-Smooth Component 1: Input: η > 0, Φ(z) 2: Initialization: z0 = w0 = minz∈W Φ(z) and f0 (w) = 0 3: for t = 1, . . . , T do 4: Predict wt by { } L wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 ) + g(w) η w∈W Receive cost function ft (·) and incur loss ft (wt ) Update zt by ( ) η ∗ ′ zt = ∇Φ ∇Φ(wt ) + (∇ft−1 (wt−1 ) − ∇ft (wt )) L 5: 6: and zt = minw∈W B(w, z′t ) 7: end for Algorithm 8 shows the detailed steps and Corollay 6.2 states the regret bound, which can be proved similarly. Corollary 6.2. Let ft (w) = ft (w) + g(w), t = 1, . . . , T be a sequence of convex functions where ft (w) are L-smooth continuous w.r.t ∥ · ∥ and g(w) is a non-smooth function, Φ(z) be a α-strongly convex function w.r.t ∥ · ∥, and EGVT be defined in (5.13). By setting {√ √ } √ η = (1/2) min α/ 2, LR/ EGVT , we have the following regret bound T ∑ t=1 6.2 ft (wt ) − min w∈W T ∑ ft (w) ≤ 2R max (√ ) √ √ 2LR/ α, EGVT . t=1 Composite Losses with an Explicit Max Structure In previous subsection, we assume the composite gradient mapping with the non-smooth component can be efficiently solved. Here, we replace this assumption with an explicit max structure of the non-smooth component. In what follows, we present a primal-dual prox method for such non-smooth cost functions 184 and prove its regret bound. We consider a general setting, where the non-smooth functions ft (w) has the following structure: ft (w) = ft (w) + max⟨At w, u⟩ − ϕt (u), (6.5) u∈Q where ft (w) and ϕt (u) are L1 -smooth and L2 -smooth functions, respectively, and At ∈ Rm×d is a matrix used to characterize the non-smooth component of ft (w) with −ϕt (u) by maximization. Similarly, we define a dual cost function ϕt (u) as ϕt (u) = −ϕt (u) + min ⟨At w, u⟩ + ft (w). (6.6) x∈W We refer to w as the primal variable and to u as the dual variable. To motivate the setup, let us consider online classification with hinge loss ℓt (w) = max(0, 1 − yt ⟨w, xt ⟩), where we slightly abuse the notation (xt , yt ) to denote the attribute and label pair received at trial t. It is straightforward to see that ℓt (w) is a non-smooth function and can be cast into the form in (6.5) by ⊤ ℓt (w) = max α(1 − yt x⊤ t w) = max −αyt xt w + α. α∈[0,1] α∈[0,1] To present the algorithm and analyze its regret bound, we introduce some notations. Let Ft (w, u) = ft (w) + ⟨At w, u⟩ − ϕt (u), Φ1 (w) be a α1 -strongly convex function defined on the primal variable w w.r.t a norm ∥ · ∥p and Φ2 (u) be a α2 -strongly convex function defined on the dual variable u w.r.t a norm ∥ · ∥q . Correspondingly, let B1 (w, z) and B2 (u, v) denote the induced Bregman distance, respectively. We assume the domains W, Q are bounded 185 and matrices At have a bounded norm, i.e., max ∥w∥p ≤ R1 , w∈W max ∥u∥q ≤ R2 u∈Q max Φ1 (w) − min Φ1 (w) ≤ M1 w∈W w∈W (6.7) max Φ2 (u) − min Φ2 (u) ≤ M2 u∈Q ∥At ∥p,q = u∈Q max ∥w∥p ≤1,∥u∥q ≤1 u⊤ At w ≤ σ. Let ∥ · ∥p,∗ and ∥ · ∥q,∗ denote the dual norms to ∥ · ∥p and ∥ · ∥q , respectively. To prove a variational regret bound, we define a gradual variation as follows: EGVT,p,q = + T∑ −1 t=0 T∑ −1 ∥∇ft+1 (wt ) − ∇ft (wt )∥2p,∗ + (R12 + R22 ) T∑ −1 ∥At − At−1 ∥2p,q t=0 ∥∇ϕt+1 (ut ) − ∇ϕt (ut )∥2q,∗ . t=0 Given above notations, Algorithm 9 shows the detailed steps and Theorem 6.3 states a min-max bound. Theorem 6.3. Let ft (w) = ft (w) + maxu∈Q ⟨At w, u⟩ − ϕt (u), t = 1, . . . , T be a sequence of non-smooth functions. Assume ft (w), ϕ(u) are L = max(L1 , L2 )-smooth functions and the domain W, Q and At satisfy the boundness condition as in (6.7). Let Φ(w) be a α1 -strongly convex function w.r.t the norm ∥·∥p , Φ(u) be a α2 -strongly convex function w.r.t. the norm ∥· ) (√ √ √ √ M1 + M2 /(2 EGVT,p,q ), α/(4 σ 2 + L2 ) ∥q , and α = min(α1 , α2 ). By setting η = min in Algorithm 9, then we have 186 Algorithm 9 Online Mirror Prox Method with an Explicit Max Structure 1: Input: η > 0, Φ1 (z), Φ2 (v) 2: Initialization: z0 = w0 = minz∈W Φ1 (z), v0 = u0 = minv∈Q Φ2 (v) and f0 (w) = ϕ0 (u) = 0 3: for t = 1, . . . , T do 4: Update ut by { } L2 ut = arg max ⟨u, At−1 wt−1 − ∇ϕt−1 (ut−1 )⟩ − B (u, vt−1 ) η 2 u∈Q 5: Predict wt by { wt = arg min w∈W } L1 ⟨w, ∇ft−1 (wt−1 ) + At−1 A⊤ t−1 ut−1 ⟩ + η B1 (w, zt−1 ) 6: 7: Receive cost function ft (·) and incur loss ft (wt ) Update vt by } { L2 vt = arg max ⟨u, At wt − ∇ϕt (ut )⟩ − B (u, vt−1 ) η 2 u∈Q 8: Update zt by { zt = arg min w∈W L1 ⟨w, ∇ft (wt ) + A⊤ B1 (w, zt−1 ) t ut ⟩ + } η 9: end for max u∈Q T ∑ Ft (wt , u) − min t=1 w∈W T ∑ Ft (w, ut ) t=1 ( √ ) √ (M1 + M2 )(σ 2 + L2 ) √ ≤ 4 M1 + M2 max 2 , EGVT,p,q . α To facilitate understanding, we break the proof into several lemmas. The following lemma is by analogy with Lemma 5.16. √ (w) denote a single vector with a norm ∥θ∥ = ∥w∥2p + ∥u∥2q and Lemma 6.4. Let θ = √ u a dual norm ∥θ∥∗ = ∥w∥2p∗ + ∥u∥2q∗ . Let Φ(θ) = Φ1 (w) + Φ2 (u), B(θ, ζ) = B1 (w, u) + B2 (z, v). Then 187 ( ) ( ) ∇w Ft (wt , ut ) ⊤ wt − w η ≤ B(θ, ζt−1 ) − B(θ, ζt ) −∇u Ft (wt , ut ) ut − u ) ( + η 2 ∥∇w Ft (wt , ut ) − ∇w Ft−1 (wt−1 , ut−1 )∥2p,∗ ( ) + η 2 ∥∇u Ft (wt , ut ) − ∇u Ft−1 (wt−1 , ut−1 )∥2q,∗ ) α( 2 2 2 2 − ∥wt − zt ∥p + ∥ut − vt ∥q + ∥wt − zt−1 ∥p + ∥ut − vt−1 ∥q . 2 Proof. The updates of (wt , ut ) in Algorithm 9 can be seen as applying the updates in ( ) ( ) ( ) wt zt zt−1 Lemma 5.16 with θt = in place of w, ζt = in place of z+ , ζt−1 = in ut vt vt−1 place of z. Note that Φ(θ) is a α = min(α1 , α2 )-strongly convex function with respect to the norm ∥θ∥. Then applying the results in Lemma 5.16 we can complete the proof. Applying the convexity of Ft (w, u) in terms of w and the concavity of Ft (w, u) in terms of u to the result in Lemma 6.4, we have η (Ft (wt , u) − Ft (w, ut )) ≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt ) ⊤ 2 + η 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 ) + A⊤ t ut − At−1 ut−1 ∥p,∗ + η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 ) + At wt − At−1 wt−1 ∥2q,∗ ) α( ) α( − ∥wt − zt−1 ∥2p + ∥wt − zt ∥2p − ∥ut − vt−1 ∥2q + ∥ut − vt ∥2q 2 2 ≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt ) ⊤ 2 + 2η 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ + 2η 2 ∥A⊤ t ut − At−1 ut−1 ∥p,∗ + 2η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 )∥2q,∗ + 2η 2 ∥At wt − At−1 wt−1 ∥2q,∗ ) α( ) α( − ∥wt − zt−1 ∥2p + ∥wt − zt ∥2p − ∥ut − vt−1 ∥2q + ∥ut − vt ∥2q 2 2 188 (6.8) The following lemma provides tools for proceeding the bound. Lemma 6.5. ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ ≤ 2∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 2L2 ∥wt − wt−1 ∥2p ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 )∥2q,∗ ≤ 2∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗ + 2L2 ∥ut − ut−1 ∥2q ∥At wt − At−1 wt−1 ∥2q,∗ ≤ 2R12 ∥At − At−1 ∥2p,q + 2σ 2 ∥wt − wt−1 ∥2p ⊤ 2 2 2 2 2 ∥A⊤ t ut − At−1 ut−1 ∥p,∗ ≤ 2R2 ∥At − At−1 ∥p,q + 2σ ∥ut − ut−1 ∥q Proof. We prove the first and the third inequalities. Another two inequalities can be proved similarly. ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ ≤ 2∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 2∥∇ft−1 (wt ) − ∇ft−1 (wt−1 )∥2p,∗ ≤ 2∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 2L2 ∥wt − wt−1 ∥2p where we use the smoothness of ft (w). ∥At wt − At−1 wt−1 ∥2q,∗ ≤ 2∥(At − At−1 )wt ∥2q,∗ + 2∥At−1 (wt − wt−1 )∥2p ≤ 2R12 ∥At − At−1 ∥2p,q + 2σ 2 ∥wt − wt−1 ∥2p 189 Lemma 6.6. For any w ∈ W and u ∈ Q, we have η (Ft (wt , u) − Ft (w, ut )) ≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt ) ( 2 + 4η ∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + ∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗ ) 2 2 2 + (R1 + R2 )∥At − At−1 ∥p,q + 4η 2 σ 2 ∥wt − wt−1 ∥2p + 4η 2 L2 ∥wt − wt−1 ∥2p − + 4η 2 σ 2 ∥ut − ut−1 ∥2p + 4η 2 L2 ∥ut − ut−1 ∥2q − α (∥wt − zt ∥2p + ∥wt − zt−1 ∥2p ) 2 α (∥ut − vt ∥2q + ∥ut − vt−1 ∥2q ). 2 Proof. The lemma can be proved by combining the results in Lemma 6.5 and the inequality in (6.8). η (Ft (wt , u) − Ft (w, ut )) ≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt ) ⊤ 2 + 2η 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ + 2η 2 ∥A⊤ t ut − At−1 ut−1 ∥p,∗ + 2η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 )∥2q,∗ + 2η 2 ∥At wt − At−1 wt−1 ∥2q,∗ ) α( ) α( − ∥wt − zt−1 ∥2p + ∥wt − zt ∥2p − ∥ut − vt−1 ∥2q + ∥ut − vt ∥2q 2 2 ≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt ) + 4η 2 ∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 4η 2 L2 ∥wt − wt−1 ∥2p + 4η 2 σ 2 ∥wt − wt−1 ∥2p + 4η 2 R12 ∥At − At−1 ∥2p,q + 4η 2 R22 ∥At − At−1 ∥2p,q + 4η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗ + 4η 2 L2 ∥ut − ut−1 ∥2q + 4η 2 σ 2 ∥ut − ut−1 ∥2q ) α( ) α( − ∥wt − zt−1 ∥2p + ∥wt − zt ∥2p − ∥ut − vt−1 ∥2q + ∥ut − vt ∥2q 2 2 190 ≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt ) ( 2 + 4η ∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + ∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗ ) 2 2 2 + (R1 + R2 )∥At − At−1 ∥p,q + 4η 2 σ 2 ∥wt − wt−1 ∥2p + 4η 2 L2 ∥wt − wt−1 ∥2p − + 4η 2 σ 2 ∥ut − ut−1 ∥2p + 4η 2 L2 ∥ut − ut−1 ∥2q − α (∥wt − zt ∥2p + ∥wt − zt−1 ∥2p ) 2 α (∥ut − vt ∥2q + ∥ut − vt−1 ∥2q ). 2 Proof of Theorem 6.3. Taking summation of the inequalities in Lemma 6.6 over t = 1, . . . , T , √ √ applying the inequality in (5.32) twice and using η ≤ α/(4 σ 2 + L2 ), we have T ∑ t=1 Ft (wt , u) − T ∑ Ft (w, ut ) ≤ 4ηEGVT,p,q + t=1 M1 + M2 η ( √ ) √ (M1 + M2 )(σ 2 + L2 ) √ = 4 M1 + M2 max 2 , EGVT,p,q . α We complete the proof by using w∗ = arg min w∈W T ∑ Ft (wt , u) t=1 and u∗ = arg max u∈Q T ∑ Ft (w, ut ) t=1 . As an immediate result of Theorem 6.3, the following Corollary states a regret bound for 191 non-smooth optimization with a fixed non-smooth component that can be written as a max structure, i.e., ft (w) = ft (w) + [g(w) = maxu∈Q ⟨Aw, u⟩ − ϕ(u)]. Corollary 6.7. Let ft (w) = ft (w) + g(w), t = 1, . . . , T be a sequence of non-smooth functions, where g(w) = maxu∈Q ⟨Aw, u⟩ − ϕ(u), and the gradual variation EGVT be defined in (5.13) w.r.t the dual norm ∥ · ∥p,∗ . Assume ft (w) are L-smooth functions w.r.t ∥ · ∥, the domain W, Q and A satisfy the boundness condition as in (6.7). If we set (√ ) √ √ √ η = min M1 + M2 /(2 EGVT ), α/(4 σ 2 + L2 ) in Algorithm 9, then we have the following regret bound T ∑ ft (wt ) − min w∈W t=1 T ∑ ft (w) t=1 ( √ ) √ (M1 + M2 )(σ 2 + L2 ) √ ≤ 4 M1 + M2 max 2 , EGVT + V(g, w1:T ), α where wT = ∑T t=1 wt /T and V(g, w1:T ) = ∑T t=1 |g(wt ) − g(wT )| measures the variation in the non-smooth component. Proof. In the case of fixed non-smooth component, the gradual variation defined in (6.8) reduces the one defined in (5.13) w.r.t the dual norm ∥ · ∥p,∗ . By using the bound in (6.9) and noting that ft (w) = maxu∈Q Ft (w, u) ≥ Ft (w, ut ), we have ) ∑ T ( T ∑ ft (wt ) + ⟨Awt , u⟩ − ϕ(u) ≤ ft (w)+ t=1 t=1 ( √ ) √ (M1 + M2 )(σ 2 + L2 ) √ , EGVT . 4 M1 + M2 max 2 α 192 Therefore ) ∑ T ( T ∑ ft (wt ) + g(wT ) ≤ ft (w)+ t=1 t=1 ( √ ) √ (M1 + M2 )(σ 2 + L2 ) √ , EGVT . 4 M1 + M2 max 2 α We complete the proof by complementing ft (wt ) with g(wt ) to obtain ft (wt ) and moving ∑ the additional term Tt=1 (g(wT ) − g(wt )) to the right hand side . Remark 6.8. Note that the regret bound in Corollary 6.7 has an additional term V (g, w1:T ) compared to the regret bound in Corollary 6.2, which constitutes a tradeoff between the reduced computational cost in solving a composite gradient mapping. To see an application of Theorem 6.3 to an online non-smooth optimization with timevarying non-smooth components, let us consider the example of online classification with hinge loss. At each trial, upon receiving an example xt , we need to make a prediction based on the current model wt , i.e., yt = ⟨wt , xt ⟩, then we receive the true label of xt denoted by yt ∈ {+1, −1}. The goal is to minimize the total number of mistakes across the time line MT = ∑T t=1 I[yt yt ≤ 0]. Here we are interested in a scenario that the data sequence (xt , yt ), t = 1, . . . , T has a small gradual variation in terms of yt xt . To obtain such a gradual variation based mistake bound, we can apply Algorithm 9. For the purpose of deriving the mistake bound, we need to make a small change to Algorithm 9. At the beginning of each trial, we first make a prediction yt = ⟨wt , xt ⟩, and if we make a mistake I[yt yt ≤ 0] the we proceed to update the auxiliary primal-dual pair (wt′ , βt ) similar to (zt , vt ) in Algorithm 9 and the primal-dual pair (wt+1 , αt+1 ) similar to (wt+1 , ut+1 ) in Algorithm 9, which are 193 given explicitly as follows: βt = ∏( ) βt−1 + η(1 − wt⊤ yt xt ) , wt′ = αt+1 = ′ (wt−1 + ηαt yt xt ) ∥w∥2 ≤R [0,1] ∏( ∏ ) ⊤ βt + η(1 − wt yt xt ) , wt+1 = ∏ (wt′ + ηαt yt xt ). ∥w∥2 ≤R [0,1] Without loss of generality, we let (xt , yt ), t = 1, . . . , MT denote the examples that are predicted incorrectly. The function Ft (·, ·) is written as Ft (w, α) = α(1 − yt ⟨w, xt ⟩). Then for a total sequence of T examples, we have the following bound by assuming ∥wt ∥2 ≤ 1 and √ η ≤ 1/2 2 MT ∑ Ft (wt , α) ≤ t=1 MT ∑ ℓ(yt wt⊤ xt ) + η MT −1 ∑ t=1 (R2 + 1)∥yt+1 xt+1 − yt xt ∥22 + t=0 R 2 + α2 . 2η Since yt wt⊤ xt is less than 0 for the incorrectly predicted examples, if we set α = 1 in the above inequality, we have MT ≤ ≤ MT ∑ t=1 MT ∑ ℓ(yt wt⊤ xt ) + η MT −1 ∑ (R2 + 1)∥yt+1 xt+1 − yt xt ∥22 + t=0 ℓ(yt wt⊤ xt ) + R2 + 1 2η √ √ 2(R2 + 1) max(2, EGVT,2 ). t=1 which results in a gradual variational mistake bound, where √ EVGT,2 measures the gradual variation in the incorrectly predicted examples. To end the discussion, we note that one may find applications of a small gradual variation of yt xt in time series classification. For instance, if xt represent some medical measurements of a person and yt indicates whether the person observes a disease, since the health conditions usually change slowly then it is expected that 194 the gradual variation of yt xt is small. Similarly, if xt are some sensor measurements of an equipment and yt indicates whether the equipment fails or not, we would also observe a small gradual variation of the sequence yt xt during a time period. 6.3 Summary In this chapter we developed a simplified online mirror prox method using a composite gradient mapping for non-smooth optimization with a fixed non-smooth component and a primal-dual prox method for non-smooth optimization with the non-smooth component written as a max structure. Despite the impossibility result in Chapter 5 which demonstrated that smoothness of loss functions in necessary to obtain gradual variation bounds, we showed that a simplified version of online mirror prix method is able to attain regret bounded by gradual variation for loss functions with a smooth component and two types of mentioned non-smooth components. 195 Chapter 7 Mixed Optimization for Smooth Losses In this part of the thesis, we consider stochastic convex optimization problem and show that leveraging the smoothness of functions allows us to devise stochastic optimization algorithms that enjoy faster convergence rate. The focus of this chapter is on stochastic smooth optimization. The motivation for exploiting smoothness in stochastic optimization stems from the observation that the optimal √ convergence rate for stochastic optimization of smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). Therefore, it is of great interest to exploit smoothness in stochastic setting as well. In particular, we are interested in designing an efficient algorithm that is in the same spirit of the stochastic gradient descent method, but can effectively leverage the smoothness of the loss function to achieve a significantly faster convergence rate. We introduce a new setup for optimizing convex functions, termed as mixed optimization, which allows to access both a stochastic oracle and a full gradient oracle to take advantages of their individual merits. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of 196 accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). The key insight underlying the mixed optimization paradigm is that by infrequent use of full gradients at specific points we are able to progressively reduce the variance of stochastic gradients which leads to faster convergence rates. The rest of this chapter is organized as follows. In Section 7.1 we motivate the problem. Section 7.2 describes the MixedGrad algorithm, discusses the main intuition behind it, and states the main result on its convergence rate. The proof of convergence rate is given in Section 7.3 and the omitted proofs are deferred to Section 7.4. Section 7.5 concludes the paper and discusses few open questions. Finally, Section 7.6 briefly reviews the literature on deterministic and stochastic optimization. 7.1 Motivation As it has been shown in Chapter 2, many practical machine learning algorithms follow the framework of empirical risk minimization, which often can be cast into the following generic optimization problem: 1∑ min F(w) := fi (w), n w∈W n (7.1) i=1 where n is the number of training examples, fi (w) encodes the loss function related to the ith training example (xi , yi ), and W is a bounded convex domain that is introduced to regularize the solution w ∈ W (i.e., the smaller the size of W, the stronger the regularization 197 is). In this chapter, we focus on the learning problems for which the loss function fi (w) is smooth. Examples of smooth loss functions include least square with fi (w) = (yi − ⟨w, xi ⟩)2 and logistic regression with fi (w) = log (1 + exp(−yi ⟨w, xi ⟩)). Since the regularization is enforced through the restricted domain W, we did not introduce a ℓ2 regularizer λ∥w∥2 /2 into the optimization problem and as a result, we do not assume the loss function to be strongly convex. We note that a small ℓ2 regularizer does NOT improve the convergence rate of stochastic optimization. More specifically, the convergence rate for stochastically √ √ optimizing a ℓ2 regularized loss function remains as O(1/ T ) when λ = O(1/ T ) [72], a scenario that is often encountered in real-world applications. A preliminary approach for solving the optimization problem in (7.1) is the batch gradient descent (GD) algorithm [121]. It starts with some initial point, and iteratively updates the solution using the equation wt+1 = ΠW (wt − η∇F (wt )), where ΠW (·) is the orthogonal projection onto the convex domain W. It has been shown that for smooth objective functions, the convergence rate of standard GD is O(1/T ) [121], and can be improved to O(1/T 2 ) by an accelerated GD algorithm [120, 121, 123]. The main shortcoming of GD method is its high cost in computing the full gradient ∇F(wt ) when the number of training examples is large, i.e., it requires O(n) gradient computations per iteration. Stochastic gradient descent (SGD) alleviates this limitation of GD by sampling one (or a small set of) examples and computing a stochastic (sub)gradient at each iteration based on the sampled examples [29, 115, 134]. Since the computational cost of SGD per iteration is independent of the size of the data (i.e., n), it is usually appealing for large-scale learning and optimization. While SGD enjoys a high computational efficiency per iteration, it suffers from a slow convergence rate for optimizing smooth functions. It has been shown in [119] that the √ effect of the stochastic noise cannot be decreased with a better rate than O(1/ T ) which 198 is significantly worse than GD that uses the full gradients for updating the solutions and this limitation is also valid when the target function is smooth. In addition for general Lipschitz continuous convex functions, SGD exhibits the same convergence rate as that for the smooth functions, implying that smoothness of the loss function is essentially not very useful and can not be exploited in stochastic optimization. The slow convergence rate for stochastically optimizing smooth loss functions is mostly due to the variance in stochastic gradients: unlike the full gradient case where the norm of a gradient approaches to zero when the solution is approaching to the optimal solution, in stochastic optimization, the norm of a stochastic gradient is constant even when the solution is close to the optimal solution. It is √ the variance in stochastic gradients that makes the convergence rate O(1/ T ) unimprovable for stochastic smooth optimization [119, 4]. In this chapter, we are interested in designing an efficient algorithm that is in the same spirit of SGD but can effectively leverage the smoothness of the loss function to achieve a significantly faster convergence rate. To this end, we consider a new setup for optimization that allows us to interplay between stochastic and deterministic gradient descent methods. In particular, we assume that the optimization algorithm has an access to two oracles: • A stochastic oracle Os that returns the loss function fi (w) based on the sampled training example (xi , yi ) 1 , and • A full gradient oracle Of that returns the gradient ∇F(w) for any given solution w ∈ W. We refer to this new setting as mixed optimization in order to distinguish it from both stochastic and full gradient optimization models. Obviously, the challenging issue in this 1 We note that the stochastic oracle assumed in our study is slightly stronger than the stochastic gradient oracle as it returns the sampled function instead of the stochastic gradient. 199 regard is to minimize the number of full gradients to be as minimum as possible while having the same number of stochastic gradient accesses. The key question we examined in this chapter is: ”Is it possible to improve the convergence rate for stochastic optimization of smooth functions by having a small number of calls to the full gradient oracle Of ? ” In this chapter we give an affirmative answer to this question. In particular, we show that with an additional O(ln T ) accesses to the full gradient oracle Of , the proposed algorithm, referred to as MixedGrad, can improve the convergence rate for stochastic optimization of smooth functions to O(1/T ), the same rate for stochastically optimizing a strongly convex function [72, 125, 137]. The MixedGrad algorithm builds off on multi-stage methods [72] and operates in epochs, but involves novel ingredients so as to obtain an O(1/T ) rate for smooth losses. In particular, we form a sequence of strongly convex objective functions to be optimized at each epoch and decrease the amount of regularization and shrink the domain as the algorithm proceeds. The full gradient oracle Of is only called at the beginning of each epoch. In this chapter our focus is only smooth functions and in the coming chapters, we show that the it is further possible to develop faster optimization schemes when the functions is both smooth and strongly convex by making the number of full gradients independent of condition number. 7.2 The MixedGrad Algorithm In stochastic first-order optimization setting, instead of having direct access to F(w), we only have access to a stochastic gradient oracle, which given a solution w ∈ W, returns the gradient ∇fi (w) where i is sampled uniformly at random from {1, 2, · · · , n}. The goal 200 of stochastic optimization to use a bounded number T of oracle calls, and compute some ¯ ∈ W such that the optimization error, F(w) ¯ − F (w∗ ), is as small as possible. w In the mixed optimization model considered in this study, we first relax the stochastic oracle Os by assuming that it will return a randomly sampled loss function fi (w), instead of the gradient ∇fi (w) for a given solution w may be applied to achieve O(1/T ) convergence. Second, we assume that the learner also has an access to the full gradient oracle Of . Our goal is to significantly improve the convergence rate of stochastic gradient descent (SGD) by making a small number of calls to the full gradient oracle Of . In particular, we show that by having only O(log T ) accesses to the full gradient oracle and O(T ) accesses to the stochastic oracle, we can tolerate the noise in stochastic gradients and attain an O(1/T ) convergence rate for optimizing smooth functions. We now turn to describe the proposed mixed optimization algorithm and state its convergence rate. The detailed steps of MixedGrad algorithm are shown in Algorithm 10. It follows the epoch gradient descent algorithm proposed in [72] for stochastically minimizing strongly convex functions and divides the optimization process into m epochs, but involves novel ingredients so as to obtain an O(1/T ) convergence rate. The key idea is to introduce a ℓ2 regularizer into the objective function to make it strongly convex, and gradually reduce the amount of regularization over the epochs. We also shrink the domain as the algorithm proceeds. We note that reducing the amount of regularization over time is closely-related to the classic proximal-point algorithms. Throughout the chapter, we will use the subscript for the index of each epoch, and the superscript for the index of iterations within each epoch. Below, we describe the key idea behind the MixedGrad algorithm. ¯ k be the solution obtained before the kth epoch, which is initialized to be 0 for Let w ¯ k, the first epoch. Instead of searching for w∗ at the kth epoch, our goal is to find w∗ − w 201 Algorithm 10 MixedGrad Algorithm 1: Input: • step size η1 • domain size ∆1 • the number of iterations T1 for the first epoch • the number of epochs m • regularization parameter λ1 • shrinking parameter γ > 1 ¯1 = 0 2: Initialize: w 3: for k = 1, . . . , m do 4: Construct the domain Wk = {w : w + wk ∈ W, ∥w∥ ≤ ∆k } ¯ k) 5: Call the full gradient oracle Of for ∇F(w ∑ ¯ k + ∇F(w ¯ k ) = λk w ¯ k + n1 n ¯ k) 6: Compute gk = λk w i=1 ∇fi (w 1 7: Initialize wk = 0 8: for t = 1, . . . , Tk do 9: Call stochastic oracle Os to return a randomly selected loss function fit (w) k ˆkt = gk + ∇fit (wkt + w ¯ k ) − ∇fit (w ¯ k) 10: Compute the stochastic gradient as g k k 11: Update the solution by 1 ˆkt + λk wkt ⟩ + ∥w − wkt ∥2 wkt+1 = arg max ηk ⟨w − wkt , g 2 w∈W k end for 1 ∑T +1 wt and w ¯ k+1 = w ¯ k + wk+1 Set wk+1 = T +1 t=1 k 14: Set ∆k+1 = ∆k /γ, λk+1 = λk /γ, ηk+1 = ηk /γ, and Tk+1 = γ 2 Tk 15: end for ¯ m+1 Return w 12: 13: resulting in the following optimization problem for the kth epoch λk 1∑ ¯ k ∥2 + ¯ k ), fi (w + w ∥w + w 2 n n min w + wk ∈ W (7.2) i=1 ∥w∥ ≤ ∆k where ∆k specifies the domain size of w and λk is the regularization parameter introduced at the kth epoch. By introducing the ℓ2 regularizer, the objective function in (7.2) becomes strongly convex, making it possible to exploit the technique for stochastic optimization of 202 strongly convex function in order to improve the convergence rate. The domain size ∆k and the regularization parameter λk are initialized to be ∆1 > 0 and λ1 > 0, respectively, and are reduced by a constant factor γ > 1 every epoch, i.e., ∆k = ∆1 /γ k−1 and λk = λ1 /γ k−1 . ¯ k ∥2 /2 from the objective function in (7.2), we obtain By removing the constant term λk ∥w the following optimization problem for the kth epoch [ min w∈Wk ] n ∑ λk 1 ¯ k⟩ + ¯ k) , Fk (w) = ∥w∥2 + λk ⟨w, w fi (w + w 2 n (7.3) i=1 where Wk = {w : w + wk ∈ W, ∥w∥ ≤ ∆k }. We rewrite the objective function Fk (w) as n 1∑ λk 2 ¯ k⟩ + ¯ k) ∥w∥ + λk ⟨w, w fi (w + w Fk (w) = 2 n i=1 ⟨ ⟩ n n ∑ λk 1 1∑ 2 ¯k + ¯ k) + ¯ k ) − ⟨w, ∇fi (w ¯ k )⟩ = ∥w∥ + w, λk w ∇fi (w fi (w + w 2 n n i=1 i=1 n 1∑ k λk 2 ∥w∥ + ⟨w, gk ⟩ + fi (w) = 2 n (7.4) i=1 where 1∑ ¯k + ¯ k ) and fik (w) = fi (w + w ¯ k ) − ⟨w, ∇fi (w ¯ k )⟩. g k = λk w ∇fi (w n n i=1 The main reason for using fik (w) instead of fi (w) is to tolerate the variance in the stochastic gradients. To see this, from the smoothness assumption of fi (w) we obtain the following inequality for the norm of fik (w) as: ¯ k ) − ∇fi (w ¯ k )∥ ≤ β∥w∥. ∇fik (w) = ∥∇fi (w + w 203 As a result, since ∥w∥ ≤ ∆k and ∆k shrinks over epochs, then ∥w∥ will approach to zero over epochs and consequentially ∥∇fik (w)∥ approaches to zero, which allows us to effectively control the variance in stochastic gradients, a key to improving the convergence of stochastic optimization for smooth functions to O(1/T ). Using Fk (w) in (7.4), at the tth iteration of the kth epoch, we call the stochastic oracle Os to randomly select a loss function f k (w) and update the solution by following the standard it paradigm of SGD by wkt+1 ( ) t t k t = Πw∈W wk − ηk (λk wk + gk + ∇f t (wk )) k i k ( ) t t t ¯ k ) − ∇fit (w ¯ k )) , = Πw∈W wk − ηk (λk wk + gk + ∇fit (wk + w k k (7.5) k where Πw∈W (·) projects the solution w into the domain Wk that shrinks over epochs. k At the end of each epoch, we compute the average solution wk , and update the solution ¯ k to w ¯ k+1 = w ¯ k + wk . Similar to the epoch gradient descent algorithm [72], we from w increase the number of iterations by a constant γ 2 for every epoch, i.e. Tk = T1 γ 2(k−1) . In order to perform stochastic gradient updating given in (7.5), we need to compute vector gk at the beginning of the kth epoch, which requires an access to the full gradient oracle Of . It is easy to count that the number of accesses to the full gradient oracle Of is m, and the number of accesses to the stochastic oracle Os is T = T1 m ∑ γ 2(i−1) = i=1 γ 2m − 1 T1 . γ2 − 1 Thus, if the total number of accesses to the stochastic gradient oracle is T , the number of 204 access to the full gradient oracle required by MixedGrad algorithm is O(ln T ), consistent with our goal of making a small number of calls to the full gradient oracle. The theorem below shows that for smooth objective functions, by having O(ln T ) access to the full gradient oracle Of and O(T ) access to the stochastic oracle Os , by running MixedGrad algorithm, we achieve an optimization error of O(1/T ). Theorem 7.1. Let δ ≤ e−9/2 be the failure probability. Set γ = 2, λ1 = 16β and T1 = 300 ln m , δ η1 = 1 √ , and ∆1 = R. 2β 3T1 ( ) ¯ m+1 be the solution returned by MixedGrad method in Define T = T1 22m − 1 /3. Let w Algorithm 10 after m epochs with m = O(ln T ) calls to the full gradient oracle Of and T calls to the stochastic oracle Os . Then, with a probability 1 − 2δ, we have 80βR2 ¯ m+1 ) − min F(w) ≤ 2m−2 = O F(w 2 w∈W 7.3 ( ) β . T Analysis of Convergence Rate Now we turn to proving the main theorem. The proof will be given in a series of lemmas and theorems where the proof of few are given in Section 7.4. The proof of main theorem is based on induction. To this end, let w∗k be the optimal solution that minimizes Fk (w) defined in (7.3). The key to our analysis is show that when ∥w∗k ∥ ≤ ∆k , with a high probability, it holds that ∥w∗k+1 ∥ ≤ ∆k /γ, where w∗k+1 is the optimal solution that minimizes Fk+1 (w), as revealed by the following theorem. Theorem 7.2. Let w∗k and w∗k+1 be the optimal solutions that minimize Fk (w) and Fk+1 (w), 205 respectively, and wk+1 be the average solution obtained at the end of kth epoch of MixedGrad ( √ ) algorithm. Suppose ∥w∗k ∥ ≤ ∆k . By setting the step size ηk = 1/ 2β 3Tk , we have, with a probability 1 − 2δ, λk ∆2k ∆k k+1 ∥w∗ ∥ ≤ and Fk (wk+1 ) − min Fk (w) ≤ 4 w γ 2γ provided that δ ≤ e−9/2 and Tk ≥ 300γ 8 β 2 1 ln . δ λ2k Taking this statement as given for the moment, we proceed with the proof of Theorem 7.1, returning later to establish the claim stated in Theorem 7.2. Proof of Theorem 7.1. It is easy to check that for the first epoch, using the fact W ∈ BR , we have ∥w∗1 ∥ = ∥w∗ ∥ ≤ R := ∆1 . Let w∗m be the optimal solution that minimizes Fm (w) and let w∗m+1 be the optimal solution obtained in the last epoch. Using Theorem 7.1, with a probability 1 − 2mδ, we have ∆1 ∥w∗m ∥ ≤ m−1 , γ Fm (wm+1 ) − Fm (w∗m ) ≤ λ1 ∆21 λm ∆2m = 2γ 4 2γ 3m+1 Hence, λ1 ∆21 λ1 1∑ ¯ m⟩ ¯ m+1 ) ≤ Fm (w∗m ) + 3m+1 − m−1 ⟨wm+1 , w fi (w n 2γ γ n i=1 ≤ Fm (w∗m ) + 206 λ1 ∆21 ¯ m ∥∆ λ ∥w + 1 2m−2 1 3m+1 2γ γ where the last step uses the fact ∥w∗m+1 ∥ ≤ ∆m = ∆1 γ 1−m . Since ¯ m∥ ≤ ∥w m ∑ |wi | ≤ i=1 m ∑ ∆i ≤ i=1 γ∆1 ≤ 2∆1 γ−1 where in the last step holds under the condition γ ≥ 2. By combining above inequalities, we obtain λ1 ∆21 2λ1 ∆2 1∑ ¯ m+1 ) ≤ Fm (w∗m ) + 3m+1 + 2m−21 . fi ( w n 2γ γ n i=1 Our final goal is to relate Fm (w) to minw L(w). Since w∗m minimizes Fm (w), for any w∗ ∈ arg min L(w), we have n ) λ1 ( 1∑ 2 m ¯ m ∥ + 2⟨w∗ − w ¯ m, w ¯ m ⟩ . (7.6) fi (w∗ ) + m−1 ∥w∗ − w Fm (w∗ ) ≤ Fm (w∗ ) = n 2γ i=1 ¯ m ∥. To this end, after the first Thus, the key to bound |F(w∗m )−F (w∗ )| is to bound ∥w∗ − w ¯ m+1 , w ¯ m+2 , . . . be the sequence of m epoches, we run Algorithm 10 with f ull gradients. Let w solutions generated by Algorithm 10 after the first m epochs. For this sequence of solutions, Theorem 7.2 will hold deterministically as we deploy the full gradient for updating, i.e., ∥wk ∥ ≤ ∆k for any k ≥ m + 1. Since we reduce λk exponentially, λk will approach to zero ¯ k }∞ and therefore the sequence {w k=m+1 will converge to w∗ , one of the optimal solutions ¯ k }∞ ¯ k ∥ ≤ ∆k for any that minimize L(w). Since w∗ is the limit of sequence {w k=m+1 and ∥w k ≥ m + 1, we have ¯ m∥ ≤ ∥w∗ − w ∞ ∑ i=m+1 |wi | ≤ ∞ ∑ k=m+1 207 ∆1 2∆ ∆k ≤ m ≤ m1 −1 γ γ (1 − γ ) where the last step follows from the condition γ ≥ 2. Thus, λ1 1∑ Fm (w∗m ) ≤ fi (w∗ ) + m−1 n 2γ n ( i=1 4∆21 8∆21 + m γ γ 2m ) n n ) 1∑ 2λ1 ∆21 ( 5λ1 ∆2 1∑ −m = fi (w∗ ) + 2m−1 2 + γ ≤ fi (w∗ ) + 2m−11 n n γ γ i=1 (7.7) i=1 By combining the bounds in (7.6) and (7.7), we have, with a probability 1 − 2mδ, 5λ1 ∆2 1∑ 1∑ ¯ m+1 ) − fi (w fi (w∗ ) ≤ 2m−21 = O(1/T ) n n γ n n i=1 i=1 where T = T1 m−1 ∑ k=0 ( ) T1 γ 2m − 1 2k γ = γ2 − 1 T ≤ 1 γ 2m . 3 We complete the proof by plugging in the stated values for γ, λ1 and ∆1 . We turn now to proving the Theorem 7.2. For the convenience of discussion, we drop the subscript k for epoch just to simplify our notation. Let λ = λk , T = Tk , ∆ = ∆k , g = gk . ¯ =w ¯ k be the solution obtained before the start of the epoch k, and let w ¯′ = w ¯ k+1 be Let w the solution obtained after running through the kth epoch. We denote by F(w) and F ′ (w) the objective functions Fk (w) and Fk+1 (w). They are given by λ 1∑ ¯ + ¯ F(w) = ∥w∥2 + λ⟨w, w⟩ fi (w + w) 2 n n (7.8) i=1 F ′ (w) = λ 1∑ λ ¯ ′⟩ + ¯ ′) ∥w∥2 + ⟨w, w fi (w + w 2γ γ n n (7.9) i=1 Let w∗ = w∗k and w∗′ = w∗k+1 be the optimal solutions that minimize F(w) and F ′ (w) over the domain Wk and Wk+1 , respectively. Under the assumption that ∥w∗ ∥ ≤ ∆, our goal is 208 to show ∥w∗′ ∥ ≤ ∆ , γ ¯ ′ ) − F(w∗ ) ≤ F(w λ∆2 2γ 4 The following lemma bounds F(wt ) − F(w∗ ) where the proof is deferred to Section 7.4. Lemma 7.3. ∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 − 2η 2η 2 η + ∇fit (wt ) + λwt + ⟨g, wt − wt+1 ⟩ 2 ⟨ ⟩ + ∇F(w∗ ) − ∇fit (w∗ ), wt − w∗ ⟨ ⟩ + −∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗ F(wt ) − F (w∗ ) ≤ ¯ 1 = 0, we By adding the inequality in Lemma 7.7 over all iterations, using the fact w have T ∑ F(wt ) − F(w∗ ) ≤ t=1 + ∥w∗ ∥2 ∥wT +1 − w∗ ∥2 − − ⟨g, wT +1 ⟩ 2η 2η ∑ η∑ ⟨∇F(w∗ ) − ∇fit (w∗ ), wt − w∗ ⟩ ∥∇fit (wt ) + λwt ∥2 + 2 T T t=1 t=1 ≜AT + T ⟨ ∑ ≜BT ⟩ −∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗ . t=1 ≜CT Since g = ∇F (0) and F(wT +1 ) − F(0) ≤ ⟨∇F (0), wT +1 ⟩ + β β ∥wT +1 ∥2 = ⟨g, wT +1 ⟩ + ∥wT +1 ∥2 2 2 209 using the fact F(0) ≤ F (w∗ ) + β2 ∥w∗ ∥2 and max(∥w∗ ∥, ∥wT +1 ∥) ≤ ∆, we have −⟨g, wT +1 ⟩ ≤ F(0) − F(wT +1 ) + β 2 ∆ ≤ β∆2 − (F(wT +1 ) − F(w∗ )) 2 and therefore T∑ +1 ( F(wt ) − F(w∗ ) ≤ ∆2 t=1 1 +β 2η ) η + AT + BT + CT . 2 (7.10) The following lemmas bound AT , BT and CT . Lemma 7.4. For AT defined above we have AT ≤ 6β 2 ∆2 T . The following lemma upper bounds BT and CT . The proof is based on the Bernstein’s inequality for martingales and is given later . Lemma 7.5. With a probability 1 − 2δ, we have ( ) ( ) √ √ 1 1 1 1 BT ≤ β∆2 ln + 2T ln and CT ≤ 2β∆2 ln + 2T ln . δ δ δ δ Using Lemmas 7.8 and 7.9, by substituting the uppers bounds for AT , BT , and CT in (7.10), with a probability 1 − 2δ, we obtain ( ) √ T∑ +1 1 1 1 F(wt ) − F(w∗ ) ≤ ∆2 + β + 6β 2 ηT + 3β ln + 3β 2T ln 2η δ δ t=1 √ By choosing η = 1/[2β 3T ], we have ( ) √ T∑ +1 √ 1 1 F(wt ) − F(w∗ ) ≤ ∆2 2β 3T + β + 3β ln + 3β 2T ln δ δ t=1 and using the fact w = ∑T +1 i=1 wt /(T + 1), we have √ √ 5β 3 ln[1/δ] 5β 3 ln[1/δ] √ F(w) − F (w∗ ) ≤ ∆2 √ , and ∆2 = ∥w − w∗ ∥2 ≤ ∆2 . T +1 λ T +1 210 Thus, when T ≥ [300γ 8 β 2 ln 1δ ]/λ2 , we have, with a probability 1 − 2δ, ∆2 ≤ ∆2 λ , and |F(w) − F(w∗ )| ≤ 4 ∆2 . 4 γ 2γ (7.11) The next lemma relates ∥w∗′ ∥ to ∥w − w∗ ∥. Lemma 7.6. We have ∥w∗′ ∥ ≤ γ∥w − w∗ ∥. Combining the bound in (7.11) with Lemma 7.10, we have ∥w∗′ ∥ ≤ ∆/γ. 7.4 7.4.1 Proofs of Convergence Rate Proof of Lemma 7.7 Before proving the lemmas we recall the definition of F(w), F ′ (w), g, and fi (w) as: λ 1∑ ¯ + ¯ ∥w∥2 + λ⟨w, w⟩ fi (w + w), 2 n n F(w) = i=1 F ′ (w) = λ 1∑ λ ¯ ′⟩ + ¯ ′ ), ∥w∥2 + ⟨w, w fi (w + w 2γ γ n n i=1 ¯+ g = λw 1 n n ∑ ¯ ∇fi (w), i=1 ¯ − ⟨w, ∇fi (w)⟩. ¯ fi (w) = fi (w + w) We also recall that w∗ and w∗′ are the optimal solutions that minimize F(w) and F ′ (w) over the domain Wk and Wk+1 , respectively. 211 Lemma 7.7. ∥wt+1 −w∗ ∥2 ∥wt −w∗ ∥2 − + η2 2η 2η F(wt ) − F(w∗ ) ≤ + ⟨ + ∇fit (wt ) + λwt 2 + ⟨g, wt − wt+1 ⟩ ⟩ ⟨ ∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ −∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗ ⟩ Proof. For each iteration t in the kth epoch, from the strong convexity of F(w) we have λ F(wt ) − F(w∗ ) ≤ ⟨∇F(wt ), wt − w∗ ⟩ − ∥wt − w∗ ∥2 2 ⟨ ⟩ λ = ⟨g + ∇fit (wt ) + λwt , wt − w∗ ⟩ + −∇fit (wt ) + ∇F(wt ), wt − w∗ − ∥wt − w∗ ∥2 , 2 where F(w) = n1 ∑n i=1 fi (w). We now try to upper bound the first term in the right hand side. Since ⟨g + ∇fit (wt ) + λwt , wt − w∗ ⟩ ∥wt − w∗ ∥2 ∥wt − w∗ ∥2 + 2η 2η 2 ∥wt+1 − w∗ ∥2 ∥wt − w∗ ∥2 ∥wt − wt+1 ∥ − + ⟨g + ∇fit (wt ) + λwt , wt − wt+1 ⟩ − 2η 2η 2η 2 2 ∥wt+1 − w∗ ∥ ∥wt − w∗ ∥ ⟨g, wt − wt+1 ⟩ − + 2η 2η [ ] ∥wt − w∥2 max ⟨∇fit (wt ) + λwt , wt − w⟩ − w 2η 2 ∥wt − w∗ ∥2 η ∥wt+1 − w∗ ∥ + + ∥∇fit (wt ) + λwt ∥2 ⟨g, wt − wt+1 ⟩ − 2η 2η 2 = ⟨g + ∇fit (wt ) + λwt , wt − w∗ ⟩ − ≤ ≤ + = where the first inequality follows from the fact that wt+1 in the minimizer of the following 212 optimization problem: wt+1 = arg min ¯ w∈W∩∥w−w∥≤∆ ⟨g + ∇fit (wt ) + λwt , w − wt ⟩ + ∥w − wt ∥2 . 2η Therefore, we obtain F(wt ) − F (w∗ ) ≤ ∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 λ − − ∥wt − w∗ ∥2 2η 2η 2 ⟩ 2 ⟨ η ∇fit (wt ) + λwt + ∇F(w∗ ) − ∇fit (w∗ ), wt − w∗ +⟨g, wt − wt+1 ⟩ + 2 ⟨ ⟩ + −∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗ , as desired. We now turn to prove the upper bound on AT . Lemma 7.8. AT ≤ 6β 2 ∆2 T Proof. We bound AT as AT = ≤ ≤ T ∑ t=1 T ∑ t=1 T ∑ ∥∇fit (wt ) + λwt ∥2 2∥∇fit (wt )∥2 + 2λ2 ∥wt ∥2 2λ2 ∆2 + 2∥∇fit (wt ) − ∇fit (w∗ ) + ∇fit (w∗ )∥2 t=1 ≤ 6β 2 ∆2 T 213 where the second inequality follows (a + b)2 ≤ 2(a2 + b2 ) and the last inequality follows from the smoothness assumption. Lemma 7.9. With a probability 1 − 2δ, we have ( BT ≤ β∆2 1 ln + δ √ 1 2T ln δ ) ( and CT ≤ 2β∆2 1 ln + δ √ 1 2T ln δ ) The proof is based on the Berstein inequality for martingales stated in Theorem A.25. Equipped with this concentration inequality, we are now in a position to upper bound BT and CT as follows. Proof of Lemma 7.9. Denote Xt = ⟨∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ ⟩. We have that the conditional expectation of Xt , given randomness in previous rounds, is Et−1 [Xt ] = 0. We now apply Theorem A.25 to the sum of martingale differences. In particular, we have, with a probability 1 − e−t , √ √ 2 BT ≤ Kt + 2Σt 3 where K = Σ = max ⟨∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ ⟩ ≤ 2β∆2 1≤t≤T T ∑ [ ] Et |⟨∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ ⟩|2 ≤ β 2 ∆4 T t=1 Hence, with a probability 1 − δ, we have ( 1 BT ≤ β∆2 ln + δ 214 √ 1 2T ln δ ) Similar, for CT , we have, with a probability 1 − δ, ( CT ≤ 2β∆2 1 ln + δ √ 1 2T ln δ ) Lemma 7.10. ∥w∗′ ∥ ≤ γ∥w − w∗ ∥. Proof. We rewrite F(w) as λ 1∑ ¯ + ¯ F(w) = ∥w∥2 + λ⟨w, w⟩ fi (w + w) 2 n n i=1 λ 1∑ ¯ + ¯ ′) = ∥w − w + w∥2 + λ⟨w − w + w, w⟩ fi (w − w + w 2 n n i=1 Define z = w − w. We have 1∑ λ ¯ + λ⟨w, w⟩ ¯ + ¯ ′) ∥z + w∥2 + λ⟨z, w⟩ fi (z + w F(w) = 2 n n i=1 n λ 1∑ λ ¯ ′⟩ + ¯ ′ ) + ∥w∥2 + λ⟨w, w⟩ ¯ = ∥z∥2 + λ⟨z, w fi (z + w 2 n 2 i=1 λ ¯ = F(z) + ∥w∥2 + λ⟨w, w⟩ 2 where λ 1∑ ¯ ′⟩ + ¯ ′) F(z) = ∥z∥2 + λ⟨z, w fi (z + w 2 n n i=1 Define w∗ = w∗ − w. Evidently, w∗ minimizes F(w). The only difference between F(w) and F ′ (w) is that they use different modulus of strong convexity λ. Thus, following [153], 215 we have ∥w∗ − w∗′ ∥ ≤ 1 − γ −1 ∥w∗ ∥ ≤ (γ − 1)∥w∗ ∥ γ −1 Hence, ∥w∗′ ∥ ≤ γ∥w∗ ∥ = γ∥w∗ − w∥ which completes the proofs. 7.5 Summary We presented a new paradigm for optimization, termed as mixed optimization, that aims to improve the convergence rate of stochastic optimization by making a small number of calls to the full gradient oracle. We proposed the MixedGrad algorithm and showed that it is able to achieve an O(1/T ) convergence rate by accessing stochastic and full gradient oracles for O(T ) and O(log T ) times, respectively. We showed that the MixedGrad algorithm is able to exploit the smoothness of the function, which is believed to be not very useful in stochastic optimization. The key insight behind the MixedGrad algorithm is to use infrequent full gradients to progressively reduce the variance of stochastic gradients as the optimization proceeds. There are few directions that are worthy of investigation. First, it would be interesting to examine the optimality of our algorithm, namely if it is possible to achieve a better convergence rate for stochastic optimization of smooth functions using O(ln T ) accesses to the full gradient oracle. Furthermore, to alleviate the computational cost caused by O(log T ) accesses to the full gradient oracle, it would be interesting to empirically evaluate the proposed algorithm in a distributed framework by distributing the individual functions among 216 processors to parallelize the full gradient computation at the beginning of each epoch which requires O(log T ) communications between the processors in total. Lastly, it is very interesting to check whether an O(1/T 2 ) rate could be achieved by an accelerated method in the mixed optimization scenario, and whether linear convergence rates could be achieved in the strongly-convex case. 7.6 Bibliographic Notes Deterministic Smooth Optimization. The convergence rate of gradient based methods usually depends on the analytical properties of the objective function to be optimized. When the objective function is strongly convex and smooth, it is well known that a simple GD method can achieve a linear convergence rate [33]. For a non-smooth Lipschitz-continuous √ function, the optimal rate for the first order method is only O(1/ T ) [121]. Although √ O(1/ T ) rate is not improvable in general, several recent studies are able to improve this rate to O(1/T ) by exploiting the special structure of the objective function [123, 122]. In the full gradient based convex optimization, smoothness is a highly desirable property. It has been shown that a simple GD achieves a convergence rate of O(1/T ) when the objective function is smooth, which is further can be improved to O(1/T 2 ) by using the accelerated gradient methods [120, 123, 121]. Stochastic Smooth Optimization. Unlike the optimization methods based on full gradients, the smoothness assumption was not exploited by most stochastic optimization methods. √ In fact, it was shown in [119] that the O(1/ T ) convergence rate for stochastic optimization cannot be improved even when the objective function is smooth. This classical result is further confirmed by the recent studies of composite bounds for the first order optimization methods [17, 96]. The smoothness of the objective function is exploited extensively in mini- 217 batch stochastic optimization [44, 47], where the goal is not to improve the convergence rate but to reduce the variance in stochastic gradients and consequentially the number of times for updating the solutions [154]. We finally note that the smoothness assumption coupled with the strong convexity of function is beneficial in stochastic setting and yields a geometric convergence in expectation using Stochastic Average Gradient (SAG) and Stochastic Dual Coordinate Ascent (SDCA) algorithms proposed in [128] and [135], respectively. Finally, we would like to distinguish mixed optimization from hybrid methods that use growing sample-sizes as optimization method proceeds to gradually transform the iterates into the full gradient method [60], which makes the iterations to be dependent to the sample size n as opposed to SGD. In contrast, MixedGrad is as an alternation of deterministic and stochastic gradient steps, with different of frequencies for each type of steps. Our result for mixed optimization is useful for the scenario when the full gradient of the objective function can be computed relatively efficient although it is still significantly more expensive than computing a stochastic gradient. An example of such a scenario is distributed computing where the computation of full gradients can be speeded up by having it run in parallel on many machines with each machine containing a relatively small subset of the entire training data. Of course, the latency due to the communication between machines will result in an additional cost for computing the full gradient in a distributed fashion. 218 Chapter 8 Mixed Optimization for Smooth and Strongly Convex Losses In the preceding chapter, we presented a new paradigm for stochastic optimization that allowed us to leverage the smoothness of objective function to devise faster algorithms. In this chapter of thesis, we continue our study of efficient optimization algorithms in mixed optimization regime and show that we may leverage the smoothness assumption of loss functions to devise algorithms with iteration complexities that are independent of condition number in accessing the full gradient oracle. To motivate the setting considered in this chapter, consider the optimization of smooth and strongly convex functions where the optimal itera√ tion complexity of the gradient-based algorithm is O( κ log 1/ϵ), where κ is the condition number of the objective function to be optimized. In the case that the optimization problem is ill-conditioned, we need to evaluate a larger number of full gradients, which could be computationally expensive despite the linear convergence rate of the algorithm in terms of the target accuracy ϵ. In this chapter, we propose to reduce the number of full gradients required by allowing the algorithm to access the stochastic gradients of the objective function. To this end, we present an algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients similar to Chapter 7. As same as MixedGrad a distinctive step in 219 EMGD is the mixed gradient descent, where we use a combination of the full gradient and the stochastic gradient to update the intermediate solutions. By performing a fixed number of mixed gradient descents, we are able to improve the sub-optimality of the solution by a constant factor, and thus achieve a linear convergence rate. Theoretical analysis shows that EMGD is able to find an ϵ-optimal solution by computing O(log 1/ϵ) full gradients and O(κ2 log 1/ϵ) stochastic gradients. We also provide experimental evidence complementing our theoretical results for classification problem on few medium-sized data sets. 8.1 Introduction The optimal iteration complexities for some popular optimization methods considering different combinations of characteristics of the objective function are shown in Table 8.1. We observe that when the objective function is smooth (and strongly convex), the convergence rate for full gradient descent is much faster than that for stochastic gradient descent. On the other hand, the evaluation of a stochastic gradient is usually significantly more efficient than that for a full gradient. Thus, replacing full gradients with stochastic gradients essentially trades the number of iterations with a low computational cost per iteration. In this chapter, we consider the case when the objective function is both smooth and √ strongly convex, where the optimal iteration complexity is O( κ log 1ϵ ) if the optimization method is first order and has access to the full gradients. For the optimization problems that are ill-conditioned, the condition number κ can be very large, leading to many evaluations of full gradients, an operation that is computationally expensive for large data sets. To reduce the computational cost, we are interested in the possibility of making the number of √ full gradients required independent from κ. Although the O( κ log 1ϵ ) rate is in general not 220 Lipschitz continuous ( ) Full Gradient Stochastic Gradient Smooth ( ) O √Lϵ ( ) O 12 1 ( ϵ2 ) O 12 ϵ O ϵ Smooth & Strongly Convex O (√ ) κ log 1ϵ ( ) 1 O λϵ Table 8.1: The optimal iteration complexity of convex optimization. L and λ are the moduli of smoothness and strongly convexity, respectively. κ = L/λ is the conditional number. improvable for any first order method, we bypass this difficulty by allowing the algorithm to have access to both full and stochastic gradients. Our objective is to reduce the iteration √ complexity from O( κ log 1ϵ ) to O(log 1ϵ ) by replacing most of the evaluations of full gradients with the evaluations of stochastic gradients. Under the assumption that stochastic gradients can be computed efficiently, this tradeoff could lead to a significant improvement in computational efficiency. We propose an efficient algorithm, dubbed Epoch Mixed Gradient Descent (EMGD), which fits the mixed optimization regime introduced in Chapter 7. The proposed EMGD algorithm divides the optimization process into a sequence of different epochs, an idea that is borrowed from the epoch gradient descent [72]. In each epoch, the proposed algorithm performs mixed gradient descent by evaluating one full gradient and O(κ2 ) stochastic gradients. It achieves a constant reduction in the optimization error for every epoch, leading to a linear convergence rate. Our analysis shows that EMGD is able to find an ϵ-optimal solution by computing O(log 1ϵ ) full gradients and O(κ2 log 1ϵ ) stochastic gradients. In other words, with the help of stochastic gradients, the number of full gradients required is reduced √ from O( κ log 1ϵ ) to O(log 1ϵ ), independent from the condition number. 221 8.2 The Epoch Mixed Gradient Descent Algorithm First, we recall from Chapter 2 that we wish to solve the following optimization problem ∫ min F(w) for F(w) = E[f (w, ξ)] = w∈W f (w, ξ)dP (ξ), (8.1) Ξ where W is a convex domain, and f (w, ξ) is a convex function with respect to the first argument. An special setting which is more appropriate for machine learning tasks is the case when the objective function can be written as a sum of finite number of convex functions, i.e., 1∑ F(w) = f (w, ξi ). n n (8.2) i=1 For learning problems such as classification and regression, each individual function in the summand can be considered as the prediction loss on the ith training example ξi = (xi , yi ) for a fixed loss function, i.e., f (w, ξi ) = ℓ(w; (xi , yi )). For simplicity of exposition, we absorb the randomness in the individual functions and use fi (w) to denote the loss on random sample ξi instead of f (w, ξi ). We note that although the formulation in (8.2) seems attractive from a practical point of view, but the proposed algorithm is general enough to solve any stochastic optimization problem formulated in (8.1). As a result, in the remainder of this chapter we base the randomness on sampling the individual functions according to the unknown distribution defined over functions. Similar to the setting introduced in Chapter 7, we assume there exist two oracles. 1. The first one is a gradient oracle Og , which for a given input point w returns the 222 gradient ∇F (w), that is, Og (w) = ∇F(w). 2. The second one is a function oracle Of , each call of which returns a random function f (w), such that F(w) = Ef [f (w)], ∀w ∈ W, and f (w) has Lipschitz continuous gradients with constant L, that is, ∥∇f (w) − ∇f (w′ )∥ ≤ L∥w − w′ ∥, ∀w, w′ ∈ W. (8.3) Although we do not define a stochastic gradient oracle directly, the function oracle Of allows us to evaluate the stochastic gradient of F(w) at any point w ∈ W. Notice that the assumption about the function oracle Of implies that the objective function F(·) is also L-smooth. To see this, since ∇F(w) = Ef [f (w)], by Jensen’s inequality we have: ∥∇F (w) − ∇F(w′ )∥ ≤ Ef ∥∇f (w) − ∇f (w′ )∥ ≤ L∥w − w′ ∥, ∀w, w′ ∈ W. (8.4) Besides, we further assume F(·) is λ-strongly convex, that is, ∥∇F(w) − ∇F(w′ )∥ ≥ λ∥w − w′ ∥, ∀w, w′ ∈ W. (8.5) From (8.4) and (8.5) it is straightforward to see that L ≥ λ. The condition number κ is defined as the ratio between these two parameters, i.e., κ = L/λ ≥ 1. The detailed steps of the proposed Epoch Mixed Gradient Descent (EMGD) are shown 223 in Algorithm 11, where we use the superscript for the index of epochs, and the subscript for the index of iterations in each epoch. Similar to the MixedGrad algorithm, we divided the optimization process into a sequence of epochs (step 4 to step 11). While in the MixedGrad algorithm the size of epochs increases exponentially and the full gradient oracle is called at the beginning of each epoch, the size of epochs and the number of access to the two types of oracles in EMGD is fixed. At the beginning of each epoch, we initialize the solution w1k to be the average solution ¯ k obtained from the last epoch, and then call the gradient oracle Og to obtain ∇F (w ¯ k ). w At each iteration t of epoch k, we call the function oracle Of to obtain a random function ftk (w) and define the mixed gradient at the current solution wtk as ˜tk = ∇F (w ¯ k ) + ∇ftk (wtk ) − ∇ftk (w ¯ k ), g which involves both the full gradient and the stochastic gradient. The mixed gradient can be ¯ k ) and the stochastic part ∇ftk (wtk ) − divided into two parts: the deterministic part ∇F(w ¯ k ). Due to the smoothness property of ftk (·), the norm of the stochastic part is well ∇ftk (w bounded, which facilitates the convergence analysis. Based on the mixed gradient, we update wtk by a gradient mapping over a shrinking ¯ k ∥ ≤ ∆k ) in step 9. Since the updating is similar to the standard domain (i.e., W ∩ ∥w − w gradient descent except for the domain constraint, we refer to it as mixed gradient descent for short. At the end of the iterations for epoch k, we compute the average value of T + 1 √ solutions, instead of T solutions, and update the domain size by reducing a factor of 2. The following theorem shows the convergence rate of the proposed algorithm. 224 Algorithm 11 Epoch Mixed Gradient Descent (EMGD) Algorithm 1: Input: • step size η • the initial domain size ∆1 • the number of iterations T per epoch • the number of epochs m ¯1 = 0 2: Initialize: w 3: for k = 1, . . . , m do ¯k 4: Set w1k = w ¯ k) 5: Call the gradient oracle Og to obtain ∇F(w 6: for t = 1, . . . , T do 7: Call the function oracle Of to obtain a random function ftk (·) 8: Compute the mixed gradient as ¯ k) ¯ k ) + ∇ftk (wtk ) − ∇ftk (w ˜tk = ∇F(w g 9: Update the solution by k = arg wt+1 1 ˜tk ⟩ + ∥w − wtk ∥2 η⟨w − wtk , g 2 ¯ k ∥≤∆k w∈W∩∥w−w min 10: end for √ 1 ∑T +1 wk and ∆k+1 = ∆k / 2 ¯ k+1 = T +1 11: Set w t t=1 12: end for ¯ m+1 Return w Theorem 8.1. Assume 1152L2 1 δ ≤ e−1/2 , T ≥ ln , and ∆1 ≥ max δ λ2 (√ ) 2 (F(0) − F(w∗ )), ∥w∗ ∥ . λ (8.6) √ ¯ m+1 be the solution returned by Algorithm 11 after m epoches that Set η = 1/[L T ]. Let w has m access to oracle Og and mT access to oracle Of . Then, with a probability at least 1 − mδ, we have λ[∆1 ]2 [∆1 ]2 m+1 m+1 2 ¯ ¯ F(w ) − F(w∗ ) ≤ m+1 , and ∥w − w∗ ∥ ≤ m . 2 2 225 Theorem 8.1 immediately implies that EMGD is able to achieve an ϵ optimization error by computing O(log 1ϵ ) full gradients and O(κ2 log 1ϵ ) stochastic gradients. 8.3 Analysis of Convergence Rate The proof of Theorem 8.1 is based on induction. From the assumption about ∆1 in (8.6), we have ¯ 1 ) − F(w∗ ) ≤ F(w λ[∆1 ]2 ¯ 1 − w∗ ∥2 ≤ [∆1 ]2 , , and ∥w 2 which means Theorem 8.1 is true for m = 0. Suppose Theorem 8.1 is true for m = k. That is, with a probability at least 1 − kδ, we have ¯ k+1 ) − F(w∗ ) ≤ F(w 12 λ[∆1 ]2 k+1 − w ∥2 ≤ [∆ ] . ¯ , and ∥ w ∗ 2k+1 2k Our goal is to show that after running the k+1-th epoch, with a probability at least 1 − (k + 1)δ, we have ¯ k+2 ) − F(w∗ ) ≤ F(w 12 λ[∆1 ]2 k+2 − w ∥2 ≤ [∆ ] . ¯ , and ∥ w ∗ 2k+2 2k+1 ¯ be the solution For the simplicity of presentation, we drop the index k for epoch. Let w obtained from the epoch k. Given the condition ¯ − F(w∗ ) ≤ F(w) λ 2 ¯ − w∗ ∥2 ≤ ∆2 , ∆ , and ∥w 2 226 (8.7) we will show that that after running the T iterations in one epoch, the new solution, denoted by w, satisfies F(w) − F(w∗ ) ≤ λ 2 1 ¯ 2 ≤ ∆2 , ∆ , and ∥w − w∥ 4 2 (8.8) with a probability at least 1 − δ. In the proof, we frequently use the following property of strongly convex function (see Appendix ??). Lemma 8.2. Let F(w) be a λ-strongly convex function over the domain W, and w∗ = arg minw∈W F(w). Then, for any w ∈ W, we have F(w) − F(w∗ ) ≥ λ ∥w − w∗ ∥2 . 2 (8.9) Define ¯ F(w) = F(w) − ⟨w, g⟩, and gt (w) = ft (w) − ⟨w, ∇ft (w)⟩. ¯ g = ∇F(w), (8.10) The objective function can be rewritten as F(w) = ⟨w, g⟩ + F(w). And the mixed gradient can be rewritten as ˜ k = g + ∇gt (wt ). g 227 (8.11) Then, the updating rule given in Algorithm 11 becomes wt+1 = arg 1 η⟨w − wt , g + ∇gt (wt )⟩ + ∥w − wt ∥2 . 2 ¯ w∈W∩∥w−w∥≤∆ min (8.12) For each iteration t in the current epoch, we have F(wt ) − F(w∗ ) (8.5) λ (8.13) ≤ ⟨∇F(wt ), wt − w∗ ⟩ − ∥wt − w∗ ∥2 2 ⟨ ⟩ λ (8.11) = ⟨g + ∇gt (wt ), wt − w∗ ⟩ + ∇F(wt ) − ∇gt (wt ), wt − w∗ − ∥wt − w∗ ∥2 , 2 and ⟨g + ∇gt (wt ), wt − w∗ ⟩ =⟨g + ∇gt (wt ), wt − w∗ ⟩ − ∥wt − w∗ ∥2 ∥wt − w∗ ∥2 + 2η 2η ∥wt − wt+1 ∥2 ∥wt+1 − w∗ ∥2 ∥wt − w∗ ∥2 − + 2η 2η 2η 2 2 ∥wt − w∗ ∥ ∥wt+1 − w∗ ∥ ≤⟨g, wt − wt+1 ⟩ + − 2η 2η ( ) ∥wt − w∥2 + max ⟨∇gt (wt ), wt − w⟩ − w 2η (8.12), (8.13) ≤ ⟨g + ∇gt (wt ), wt − wt+1 ⟩ − =⟨g, wt − wt+1 ⟩ + ∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 η − + ∥∇gt (wt )∥2 . 2η 2η 2 (8.14) Combining (8.13) and (8.14), we have F(wt ) − F(w∗ ) ≤ ∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 λ − − ∥wt − w∗ ∥2 2η 2η 2 ⟨ ⟩ η 2 + ⟨g, wt − wt+1 ⟩ + ∥∇gt (wt )∥ + ∇F(wt ) − ∇gt (wt ), wt − w∗ . 2 228 By adding the inequalities of all iterations, we have T ∑ F(wt ) − F(w∗ ) t=1 ¯ − w∗ ∥2 ∥wT +1 − w∗ ∥2 λ ∑ ∥w ¯ − wT +1 ⟩ ≤ − − ∥wt − w∗ ∥2 + ⟨g, w 2η 2η 2 T t=1 + (8.15) ∑ η∑ ∥∇gt (wt )∥2 + ⟨∇F(wt ) − ∇gt (wt ), wt − w∗ ⟩ . 2 T T t=1 t=1 ≜AT ≜BT Since F(·) is L-smooth, we have ¯ ≤ ⟨∇F (w), ¯ wT +1 − w⟩ ¯ + F(wT +1 ) − F(w) L ¯ − wT +1 ∥2 , ∥w 2 which implies ¯ − wT +1 ⟩ ⟨g, w L 2 ∆ 2 (8.7) λ L ≤ F(w∗ ) − F(wT +1 ) + ∆2 + ∆2 2 2 ¯ − F (wT +1 ) + ≤F(w) (8.16) ≤F(w∗ ) − F(wT +1 ) + L∆2 . From (8.15) and (8.16), we have T∑ +1 ( F(wt ) − F(w∗ ) ≤ ∆2 t=1 ) 1 η + L + AT + BT . 2η 2 (8.17) Next, we consider how to bound AT and BT . The upper bound of AT is given by AT = T ∑ t=1 ∥∇gt (wt )∥2 = T ∑ T (8.3) ∑ 2 2 ¯ ¯ 2 ≤ T L2 ∆2 . (8.18) ∥∇ft (wt )−∇ft (w)∥ ≤ L ∥wt − w∥ t=1 t=1 229 To bound BT , we need the Hoeffding-Azuma inequality which is stated in Theorem A.20 for completeness. Define Vt = ⟨∇F(wt ) − ∇gt (wt ), wt − w∗ ⟩, t = 1, . . . , T. Recall the definition of F(w) and gt (w) in (8.10). Based on our assumption about the function oracle Of , it is straightforward to check that V1 , . . . is a martingale difference with respect to g1 , . . .. The value of Vt can be bounded by |Vt | ≤ ≤ (8.3), (8.4) ≤ ∇F(wt ) − ∇gt (wt ) ∥wt − w∗ ∥ ¯ + ∥∇ft (wt ) − ∇ft (w)∥) ¯ 2∆ (∥∇F(wt ) − ∇F(w)∥ ¯ ≤ 4L∆2 . 4L∆∥wt − w∥ Following Theorem A.20, with a probability at least 1 − δ, we have √ BT ≤ 4L∆2 1 2T ln . δ (8.19) By adding the inequalities in (8.17), (8.18) and (8.19) together, with a probability at least 1 − δ, we have T∑ +1 t=1 ( F(wt ) − F (w∗ ) ≤ ∆2 ) √ ηT L2 1 1 +L+ + 4L 2T ln . 2η 2 δ 230 √ By choosing η = 1/[L T ], we have T∑ +1 t=1 ) ( √ √ √ 1 1 F(wt ) − F(w∗ ) ≤ L∆2 T + 1 + 4 2T log ≤ 6L∆2 2T ln . δ δ and therefore √ √ (A.2) 6L 2 ln 1/δ 12L 2 ln 1/δ √ F(w) − F(w∗ ) ≤ ∆2 √ , and ∥w − w∗ ∥2 ≤ ∆2 . T +1 λ T +1 Thus, when 1152L2 1 T ≥ ln , δ λ2 with a probability at least 1 − δ, we have F(w) − F(w∗ ) ≤ 8.4 λ 2 1 ∆ , and ∥w − w∗ ∥2 ≤ ∆2 . 4 2 Experiments In this section, we provide experimental evidence complementing our theoretical results. In particular we consider solving the regularized logistic regression problem formulated as: λ 1∑ fi (w) + ∥w∥2 w n 2 n min i=1 where −yi fi (w) = log (1 + exp(−yi ⟨w, xi ⟩)) , and ∇fi (w) = x. 1 + exp(yi ⟨w, xi ⟩) i 231 We compare the EMGD algorithm to the SGD method. SGD start with solution w1 = 0 and at each iteration samples a random function indexed by ik uniformly at random over all n available functions and updates the solution by: ) 1 ( wt+1 = wt − ∇fik (wt ) + λwt = λt ( ) 1 1 1− wt − ∇fik (wt ). t λt The variance of SGD method at each iteration can be computed as: ( ∇fik (wt ) + λwt − =Eik ∇fik (wt ) 2 1 − n 1∑ ∇fi (wt ) + λwt n n i=1 n ∑ 2 ∇fi (wt ) i=1 ) 2 1∑ = Eik ∇fik (wt ) − ∇fi (wt ) n n 2 i=1 1 1∑ = ∥∇fi (wt )∥2 − ∇fi (wt ) n n n 2 i=1 By noting the gradient at each iteration as ( ( n ) ( n ) ) ( ) 1∑ 1∑ ¯ + λw ¯ + ¯ − λw ¯ − ∇fik (wt ) + λwt − ∇fik (w) ∇fi (w) ∇fi (wt ) + λwt , n n i=1 i=1 the variance of the mixed gradient in EMGD is given by ( ¯ − Eik ∇fik (wt ) − ∇fik (w) 1 1 ¯ 2− = ∥∇fi (wt ) − ∇fi (w)∥ n n ) 2 n n 1∑ 1∑ ¯ − ∇fi (w) ∇fi (wt ) n n i=1 n ∑ i=1 n ∑ i=1 i=1 1 ∇fi (wt ) − n ¯ ∇fi (w) 2 . We run both algorithms on two well-known Adult and RCV1 data sets. The data sets used in this experiment were sourced from the University of California Irvine (UCI) Machine Learning Repository [11] and are referred to as the Adult and RCV1 data sets. The Adult data set contain information on individuals such as age, level of education and current 232 −5 10 Variance Training Loss − Optimum 0 10 −10 10 −10 10 EMGD SGD 0 −5 10 50 # of Gradients (a) The training error −15 10 100 EMGD SGD 0 50 100 # of Gradients (b) The variance of the (stochastic or mixed) gradient −5 10 −5 10 Variance Training Loss − Optimum Table 8.2: Experimental results on the Adult data set. λ = 1e−3 and η = 0.01 in EMGD. −10 10 −10 10 EMGD SGD 0 50 100 150 # of Gradients (a) The training error −15 200 10 0 EMGD SGD 50 100 150 200 # of Gradients (b) The variance of the (stochastic or mixed) gradient Table 8.3: Experimental results on the RCV1 data set. λ = 1e−5 and η = 1 in EMGD. employment type. The Adult data set contains over 30,000 records of census information taken in 1994 from many diverse demographics. The RCV1 is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. For EMGD, we set the number of stochastic gradient in each epoch to n, which is the number of training data. The results of both SGD and EMGD algorithms on the Adult data set are provided in Figure 8.2. Figure 8.1a shows the difference between the current objective value and the 233 Testing Accuracy Testing Accuracy 0.85 0.845 0.84 0.835 0.83 0.95 0.94 EMGD SGD 0.825 0.82 0 0.96 10 20 # of Gradients (a) The Adult data set EMGD SGD 0.93 0 30 50 # of Gradients (b) The RCV1 data set 100 Table 8.4: The testing accuracy on RCV1 and Adult data sets. optimum (which is obtained by running a batch algorithm for a long time) versus The number of full gradients + The number of stochastic gradients . n Figure 8.1b shows the variances of the SGD and EMGD, respectively. It can be inferred from the results that the variance of SGD is almost same even when the algorithm approaches the optimal solution. As can be seen, the EMGD method is able to reduce the training error and the variance exponentially. We obtain similar results on the RCV1 data set, which are shown in Figure 8.3. We also examine the testing accuracy of SGD and EMGD, which are depicted in Figure 8.4. As can be seen, in terms of testing, EMGD is slightly worse than SGD at the beginning. This behavior is expected since EMGD needs one full gradient to initialize. 234 8.5 Discussion Compared to the optimization algorithm that only relies on full gradients [121], the number √ of full gradients needed in EMGD is O(log 1ϵ ) instead of O( κ log 1ϵ ). Compared to the optimization algorithms that only relies on stochastic gradients [81, 72, 125], EMGD is more efficient since it achieves a linear convergence rate. The proposed EMGD algorithm can also be applied to the special optimization problem considered in [128, 135], where F(w) = n1 ∑n i=1 fi (w). To make quantitative comparisons, let’s assume the full gradient is n times more expensive to compute than the stochastic gradient. Table 8.5 lists the computational complexity of the algorithms that enjoy linear convergence. As can be seen, the computational complexity of EMGD is lower than Nesterov’s algorithm [121] as long as the condition number κ ≤ n2/3 , the complexity of SAG [128] is lower than Nesterov’s algorithm if κ ≤ n/8, and the complexity of SDCA [135] is lower than Nesterov’s algorithm if κ ≤ n2 .1 The complexity of EMGD is on the same order as SAG and SDCA when κ ≤ n1/2 , but higher in other cases. Thus, in terms of computational cost, EMGD may not be the best one, but it has advantages in other aspects. 1. Unlike SAG and SDCA that only work for unconstrained optimization problem, the proposed algorithm works for both constrained and unconstrained optimization problems, provided the constrained problem in Step 9 can be solved efficiently. 1 In learning problems, we usually face a regularized optimization 1 ∑n ℓ(y ; ⟨w, x ⟩) + τ ∥w∥2 , where ℓ(·; ·) is some loss fixed function. minw∈W n i i 2 i=1 problem When the norm of the data is bounded, the smoothness parameter L can be treated as a constant. The strong convexity parameter λ is lower bounded by τ . As a result, as long as τ > Ω(n−2/3 ), which is a reasonable scenario [149], we have κ < O(n2/3 ), indicating that our proposed EMGD algorithm can be applied. 235 Nesterov’s algorithm [121] O (√ ) κn log 1ϵ EMGD SAG (n ≥ 8κ) [128] SDCA [135] ( ) O (n + κ2 ) log 1ϵ ( ) O n log 1ϵ ( ) O (n + κ) log 1ϵ Table 8.5: The computational complexity for minimizing (1/n) ∑n i=1 fi (w) 2. Unlike the SAG and SDCA that require an Ω(n) storage space, the proposed algorithm only requires the storage space of Ω(d), where d is the dimension of w. 3. The only step in Algorithm 11 that has dependence on n is step 5 for computing the ¯ k ). By utilizing distributed computing, the running time of this step gradient ∇F(w can be reduced to O(n/k), where k is the number of computers, and the convergence rate remains the same. For SAG and SDCA , it is unclear whether they can reduce the running time without affecting the convergence rate. 4. The linear convergence of SAG and SDCA only holds in expectation, whereas the linear convergence of EMGD holds with a high probability, which is much stronger. 8.6 Summary In this chapter, we considered how to reduce the number of full gradients needed for smooth and strongly convex optimization problems. Under the assumption that both the gradient and the stochastic gradient are available, the EMGD algorithm, with the help of stochas√ tic gradients, is are able to reduce the number of gradients needed from O( κ log 1ϵ ) to O(log 1ϵ ). In the case that the objective function is in the form of (8.1), i.e., a sum of n smooth functions, EMGD has lower computational cost than the full gradient method [121], if the condition number κ ≤ n2/3 . We validated our theoretical results on the convergence 236 of EMGD algorithm and in particular its ability in reducing the variance during the optimization by some experimental results on two data sets. We note that although EMGD enjoys many nice properties, it is unclear whether it is the optimal algorithm when two kinds of gradients are available. Finally, we provided experimental evidence complementing our theoretical results for classification problem on few standard data sets. 8.7 Bibliographic Notes During the last three decades, there have been significant advances in convex optimization [119, 121, 34]. In this section, we provide a brief review of the first order optimization methods. We first discuss deterministic optimization, where the gradient of the objective function is available. For the general convex and Lipschitz continuous optimization problem, the iteration complexity of gradient (subgradient) descent is O( 12 ), which is optimal up to constant ϵ factors [119]. When the objective function is convex and smooth, the optimal optimization scheme is the accelerated gradient descent developed by Nesterov, whose iteration complexity is O( √Lϵ ) [120, 123]. With slight modifications, the accelerated gradient descent algorithm can also be applied to optimize the smooth and strongly convex objective function, whose √ iteration complexity is O( κ log 1ϵ ) and is in general not improvable [121, 124]. The objective of our work is to reduce the number of access to the full gradients by exploiting the availability of stochastic gradients. In stochastic optimization, we have the access to the stochastic gradient, which is an unbiased estimate of the full gradient [115]. Similar to the case in deterministic optimization, if the objective function is convex and Lipschitz continuous, stochastic gradient (sub-gradient) 237 descent is the optimal algorithm and the iteration complexity is also O( 12 ) [119, 115]. ϵ When the objective function is strongly convex, the algorithms proposed in very recent 1 ) iteration complexity [4]. Since the converworks [81, 72, 125] achieve the optimal O( λϵ gence rate of stochastic optimization is dominated by the randomness in the gradient [94, 62], smoothness usually does not lead to a faster convergence rate for stochastic optimization. From the above discussion, we observe that the iteration complexity in stochastic optimization is polynomial in 1ϵ , making it is difficult to find high-precision solutions. However, when the objective function is strongly convex and can be written as a sum of a finite number of functions, i.e., 1∑ fi (w), n n F(w) = i=1 where each fi (w) is smooth, the iteration complexity of some specific algorithms may exhibit a logarithmic dependence on 1ϵ , i.e., a linear convergence rate. The first such algorithm is the stochastic average gradient (SAG) [128], whose iteration complexity is O(n log 1ϵ ), provided n ≥ 8κ. The second one is the stochastic dual coordinate ascent (SDCA) [135], whose iteration complexity is O((n + κ) log 1ϵ ). 238 Chapter 9 Efficient Optimization with Bounded Projections In this chapter we aim at developing more efficient optimization methods by reducing the number of projection steps. An examination of optimization methods for both online and convex optimization problems introduced in Chapter 2 reveals that most of them require projecting the updated solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making first order optimization methods unattractive for large-scale optimization problems. The broad question in this chapter of the thesis is the extent to which it is possible to reduce the number of projection steps in stochastic and online optimization algorithms. In stochastic setting, we address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical √ analysis shows that with a high probability, the proposed algorithms achieve an O(1/ T ) convergence rate for general convex functions, and an O(ln T /T ) rate for strongly convex functions under mild conditions about the domain and the objective function. The key insight underlying the proposed method for strongly convex objectives is that by smoothing the 239 objective function and leveraging it in the optimization, we are able to skip the intermediate projections. This is in contrast to other parts of the thesis where we explicitly leveraged the smoothness assumption. To the best of our knowledge, these are the first projection-free stochastic optimization methods. In online setting, to tackle the computational challenge arising from the projection steps, we consider an alternative online learning problem as follows. Instead of requiring that each solution obeys the constraints which define the convex domain, we only require the constraints to be satisfied in a long run. Then, the online learning problem becomes a task to find a sequence of solutions under the long term constraints. In other words, instead of solving the projection on each round, we allow the learner to make decisions at some rounds which do not belong to the constraint set, but the overall sequence of chosen decisions must obey the constraints at the end by a vanishing rate. We refer to this problem as online learning with soft constraints. By turning the problem into an online convex-concave √ optimization problem, we propose an efficient algorithm which achieves an O( T ) regret bound and an O(T 3/4 ) bound on the violation of the constraints. Then, we modify the algorithm in order to guarantee that the constraints are exactly satisfied in the long run. This gain is achieved at the price of getting an O(T 3/4 ) regret bound. 9.1 Setup and Motivation In stochastic setting, we consider the following convex optimization problem: min f (w), w∈W 240 (9.1) where W is a bounded convex domain. We assume that W can be characterized by an inequality constraint and without loss of generality is bounded by the unit ball, i.e., W = {w ∈ Rd : g(w) ≤ 0} ⊆ B = {w ∈ Rd : ∥w∥2 ≤ 1}, (9.2) where g(w) is a convex constraint function. We assume that W has a non-empty interior, i.e., there exists w such that g(w) < 0 and the optimal solution w∗ to (9.1) is in the interior of the unit ball B, i.e., ∥w∗ ∥2 < 1. Note that when a domain is characterized by multiple convex constraint functions, say gi (w) ≤ 0, i = 1, . . . , m, we can summarize them into one constraint g(w) ≤ 0, by defining it as g(w) = max1≤i≤m gi (w). To solve the optimization problem in (9.1), we assume that the only information available to the algorithm is through a stochastic oracle that provides unbiased estimates of the gradient of f (w). More precisely, let ξ1 , . . . , ξT be a sequence of independently and identically distributed (i.i.d) random variables sampled from an unknown distribution. At each iteration t, given solution wt , the oracle returns g(wt ; ξt ), an unbiased estimate of the true gradient ∇f (wt ), i.e., Eξt [g(wt , ξt )] = ∇f (wt ). The goal of the learner is to find an approximate optimal solution by making T calls to this oracle. Recall from Chapter 2 that to find a solution within the domain W which optimizes the given objective function f (w), SGD computes an unbiased estimate of the gradient of f (w), and updates the solution by moving it in the opposite direction of the estimated gradient. To ensure that the solution stays within the domain W, SGD has to project the updated solution back into the W at every iteration. More precisely, SGD method produces 241 a sequence of solutions by following updating: wt+1 = ΠW (wt − ηt g(wt , ξt )), (9.3) where ηt is the step size at iteration t, ΠW (·) is a projection operator that projects w into the domain W, and g(w, ξt ) is an unbiased stochastic gradient of f (w), for which we further assume bounded gradient variance as Eξt [exp(∥g(w, ξt ) − ∇f (w)∥22 /σ 2 )] ≤ exp(1). (9.4) √ For general convex optimization, stochastic gradient descent methods can obtain an O(1/ T ) convergence rate in expectation or in a high probability provided (9.4) holds [115]. Although a large number of iterations is usually needed to obtain a solution of desirable accuracy, the lightweight computation per iteration makes SGD attractive for many large-scale learning problems. The SGD method is computationally efficient only when the projection ΠW (·) can be carried out efficiently. Although efficient algorithms have been developed for projecting solutions into special domains (e.g., simplex and ℓ1 ball [51, 100]); for complex domains, such as a positive semidefinite (PSD) cone in metric learning and bounded trace norm matrices in matrix completion (more examples of complex domains can be found in [73] and [77]), the projection step requires solving an expensive convex optimization, leading to a high computational cost per iteration and consequently making SGD unappealing for large-scale optimization problems over such domains. For instance, projecting a matrix into a PSD cone requires computing the full eigen-decomposition of the matrix, whose complexity is cubic in 242 the size of the matrix. The central theme of this chapter in stochastic setting is to develop a SGD based method that does not require projection at each iteration. This problem was first addressed in a very recent work [73], where the authors extended Frank-Wolfe algorithm [56] for online learning. But, one main shortcoming of the algorithm proposed in [73] is that it has a slower convergence rate (i.e., O(T −1/3 )) than a standard SGD algorithm (i.e., O(T −1/2 )). In this work, we demonstrate that a properly modified SGD algorithm can achieve the optimal convergence rate of O(T −1/2 ) using only ONE projection for general stochastic convex optimization problem. We further develop an SGD based algorithm for strongly convex optimization that achieves a convergence rate of O(ln T /T ), which is only a logarithmic factor worse than the optimal rate [72]. The key idea of both algorithms is to appropriately penalize the intermediate solutions when they are outside the domain. With an appropriate design of penalization mechanism, the average solution wT obtained by the SGD after T iterations will be very close to the domain W, even without intermediate projections. As a result, the final feasible solution wT can be obtained by projecting wT into the domain W, the only projection that is needed for the entire algorithm. We note that our approach is very different from the previous efforts in developing projection free convex optimization algorithms (see [73, 78, 77] and references therein), where the key idea is to develop appropriate updating procedures to restore the feasibility of solutions at every iteration. 243 9.2 Stochastic Frank-Wolfe Algorithm Before presenting the proposed algorithms, here we investigate and analyze a greedy algorithm, which is a slight modification of the Frank-Wolfe (FW) method, when applied to stochastic optimization problem. As discussed in Chapter 2, the FW or conditional gradient algorithm is a feasible direction method and at each iteration, it finds the best feasible direction (with respect to the linear approximation of the function) and updates the next solution as a convex combination of the current solution and chosen feasible direction. This algorithm is revisited in [77] for deterministic smooth convex function over a convex domain aiming to devise an efficient algorithm without projections. As discussed before, the algorithm in [77] generates a sequence of solutions via the following steps: ut = arg max ⟨u, −∇f (wt )⟩ u∈W (9.5) wt+1 = (1 − ηt )wt + ηt ut The bulk of computation in FW algorithm is step (9.5) and the algorithm is an attractive method whenever step (9.5) can be achieved with low cost complexity, for otherwise, it is not practically interesting. This is true for some special domains, such as polyhedron, simplex and ℓ1 ball where the linear optimization problem can be efficiently solved. However, this assumption does not hold for general complex domains (e.g. general PSD cones). This limitation makes the FW algorithm computationally unattractive for general complex domains since it simply translates the complexity of quadratic optimization for projection into a linear optimization problem over the same domain. We note that the FW algorithm is also attractive because of its sparsity merits which is not the focus of this chapter and we only consider the computational efficiency of optimization methods. 244 A trivial modification to extend the FW algorithm to stochastic gradients is to replace the true gradient ∇f (wt ) with a stochastic gradient g(wt , ξt ) in (9.5). Next, we sketch an analysis of stochastic greedy algorithm and discuss the problem caused by using unbiased estimate of the gradient instead of true gradient in updating solutions. A key inequality in the convergence analysis of stochastic greedy algorithm is f (wt+1 ) ≤ f (wt ) − ηt max ⟨wt − w, ∇f (wt )⟩ +ηt2 L, w∈W (9.6) ≜ g(wt ,∇f (wt )) where f (w) is assumed to be L-smooth function, g(w, ∇f (w))) = maxu∈W ⟨w − u, ∇f (w)⟩ is the duality gap. Using g(wt , ∇f (wt )) ≥ f (wt ) − f (w∗ ), we can obtain an O(1/T ) convergence bound for the greedy algorithm with ηt = 2/(t+1) by induction (e.g., see Theorem 2.21 in Chapter 2). However, when using the stochastic gradient, ut = arg maxu∈W ⟨u, −g(wt , ξt )⟩, and the key inequality becomes as: f (wt+1 ) ≤ f (wt ) − ηt ⟨wt − ut , ∇f (wt )⟩ + ηt2 L ≤ f (wt ) − ηt g(wt , ∇f (wt )) + ηt2 L + ηt (g(wt , ∇f (wt )) − g(wt , g(wt , ξt ))) + ηt ⟨wt − ut , g(wt , ξt ) − ∇f (wt )⟩ ≜ ζt where the top line recovers the inequality in (9.6). However, the problem is that the quantity ζt in the second line of above inequality is not a martingale sequence, i.e., Eξ |ξ [ζ ] ̸= 0. t [t−1] t We can take a conservative analysis to bound ζt ≤ Cηt ∥∇f (wt )−g(wt , ξt )∥∗ ≜ Cηt δt , where C is a constant. Assuming ∥∇f (wt ) − g(wt , ξt )∥∗ ≤ σ, we could have, with a probably 1 − ϵ, ηt δt ≤ ηt (1 + ln(1/ϵ))σ by Markov inequality in Lemma A.18. As a result, we obtain the 245 following recursive inequality f (wt+1 ) − f (w∗ ) ≤ (1 − ηt )(f (wt ) − f (w∗ )) + ηt2 L + ηt C(1 + ln(1/ϵ))σ where the last term in the above inequality makes the convergence analysis more involved. Whether it is possible to design a sequence of step sizes ηt to have a vanishing bound for f (wt ) − f (w∗ ) is unclear for us and remains an open problem, which is beyond the scope of this chapter. 9.3 Stochastic Optimization with Single Projection We now turn to extending the SGD method to the setting where only one projection is allowed to perform for the entire sequence of updating. The main idea is to incorporate the constraint function g(w) into the objective function to penalize the intermediate solutions that are outside the domain. The result of the penalization is that, although the average solution obtained by SGD may not be feasible, it should be very close to the boundary of the domain. A projection is performed at the end of the iterations to restore the feasibility of the average solution. Before proceeding, we recall few definitions from Appendix ?? about convex analysis. Definition 9.1. A function f (w) is a Lipschitz continuous function with constant G w.r.t a norm ∥ · ∥, if |f (w1 ) − f (w2 )| ≤ G∥w1 − w2 ∥, ∀w1 , w2 ∈ B. (9.7) In particular, a convex function f (w) with a bounded (sub)gradient ∥∂f (w)∥∗ ≤ G is 246 Algorithm 12 SGD with ONE Projection by Primal Dual Updating (SGD-PD) 1: Input: a sequence of step sizes {ηt }, and a parameter γ > 0 2: Initialize:: w1 = 0 and λ1 = 0 3: for t = 1, 2, . . . , T do ′ 4: Compute wt+1 = wt − ηt (g(wt , ξt ) + λt ∇g(wt )) ′ / max (∥w′ ∥ , 1), 5: Update wt+1 = wt+1 t+1 2 6: Update λt+1 = [(1 − γηt )λt + ηt g(wt )]+ 7: end for ∑T 8: Output: wT = ΠW (wT ), where wT = t=1 wt /T . G-Lipschitz continuous, where ∥ · ∥∗ is the dual norm to ∥ · ∥. Definition 9.2. A convex function f (w) is β-strongly convex w.r.t a norm ∥·∥ if there exists a constant β > 0 (often called the modulus of strong convexity) such that, for any α ∈ [0, 1], it holds: 1 f (αw1 + (1 − α)w2 ) ≤ αf (w1 ) + (1 − α)f (w2 ) − α(1 − α)β∥w1 − w2 ∥2 , ∀w1 , w2 ∈ B. 2 When f (w) is differentiable, the strong convexity is equivalent to f (w1 ) ≥ f (w2 ) + ⟨∇f (w2 ), w1 − w2 ⟩ + β ∥w − w2 ∥2 , ∀w1 , w2 ∈ B. 2 1 In the sequel, we use the standard Euclidean norm to define Lipschitz and strongly convex functions. The key ingredient of proposed algorithms is to replace the projection step with the gradient computation of the constraint function defining the domain W, which is significantly cheaper than projection step. As an example, when a solution is restricted to a PSD cone, i.e., X ⪰ 0 where X is a symmetric matrix, the corresponding inequality constraint is g(X) = λmax (−X) ≤ 0, where λmax (X) computes the largest eigenvalue of X and is a 247 convex function. In this case, ∇g(X) only requires computing the minimum eigenvector of a matrix, which is cheaper than a full eigenspectrum computation required at each iteration of the standard SGD algorithm to restore feasibility. Below, we state a few assumptions about f (w) and g(w) often made in stochastic optimization as: Assumption 9.3. We assume that: A1 ∥∇f (w)∥2 ≤ G1 , ∥∇g(w)∥2 ≤ G2 , |g(w)| ≤ C2 , A2 Eξt [exp(∥g(w, ξt ) − ∇f (w)∥22 /σ 2 )] ≤ exp(1), ∀w ∈ B, (9.8) ∀w ∈ B. We also make the following mild assumption about the boundary of the convex domain W as: Assumption 9.4. We assume that: A3 there exists a constant ρ > 0 such that min ∥∇g(w)∥2 ≥ ρ. (9.9) g(w)=0 Remark 9.5. The purpose of introducing assumption A3 is to ensure that the optimal dual variable for the constrained optimization problem in (9.1) is well bounded from the above, a key factor for our analysis. To see this, we write the problem in (9.1) into a convex-concave optimization problem: min max f (w) + λg(w). w∈B λ≥0 Let (w∗ , λ∗ ) be the optimal solution to the above convex-concave optimization problem. Since we assume g(w) is strictly feasible, w∗ is also an optimal solution to (9.1) due to the strong duality theorem [33]. Using the first order optimality condition, we have ∇f (w∗ ) = 248 −λ∗ ∇g(w∗ ). Hence, λ∗ = 0 when g(w∗ ) < 0, and λ∗ = ∥∇f (w∗ )∥2 /∥∇g(w∗ )∥2 when g(w∗ ) = 0. Under assumption A3, we have λ∗ ∈ [0, G1 /ρ]. We note that, from a practical point of view, it is straightforward to verify that for many domains including PSD cone and Polytope, the gradient of the constraint function is lower bounded on the boundary and therefore assumption A3 does not limit the applicability of the proposed algorithms for stochastic optimization. For the example of g(X) = λmax (−X), the assumption A3 implies ming(X)=0 ∥∇g(X)∥F = ∥uu⊤ ∥F = 1, where u is an orthonomal vector representing the corresponding eigenvector of the matrix X whose minimum eigenvalue is zero. We propose two different ways of incorporating the constraint function into the objective function, which result in two algorithms, one for general convex and the other for strongly convex functions. 9.3.1 General Convex Functions To incorporate the constraint function g(w), we introduce a regularized Lagrangian function, γ L(w, λ) = f (w) + λg(w) − λ2 , 2 λ ≥ 0. The summation of the first two terms in L(w, λ) corresponds to the Lagrangian function in dual analysis and λ corresponds to a Lagrangian multiplier. A regularization term −(γ/2)λ2 is introduced in L(w, λ) to prevent λ from being too large. Instead of solving the constrained optimization problem in (9.1), we try to solve the following convex-concave optimization 249 problem min max L(w, λ). w∈B λ≥0 (9.10) The proposed algorithm for stochastically optimizing the problem in (9.10) is summarized in Algorithm 12. It differs from the existing stochastic gradient descent methods in that it updates both the primal variable w (steps 4 and 5) and the dual variable λ (step 6), which shares the same step sizes. We note that the parameter ρ is not employed in the implementation of Algorithm 12 and is only required for the theoretical analysis. It is noticeable that a similar primal-dual updating will be explored later in this chapter to avoid projection in online learning. We note that in online setting the algorithm and analysis only lead to a bound for the regret and the violation of the constraints in a long run, which does not necessarily guarantee the feasibility of final solution. Also our proof techniques differ from [115], where the convergence rate is obtained for the saddle point; however our goal is to attain bound on the convergence of the primal feasible solution. Remark 9.6. The convex-concave optimization problem in (9.10) is equivalent to the following minimization problem: min f (w) + w∈B [g(w)]2+ , 2γ (9.11) where [z]+ outputs z if z > 0 and zero otherwise. It thus may seem tempting to directly optimize the penalized function f (w) + [g(w)]2+ /(2γ) using the standard SGD method, which √ unfortunately does not yield a regret of O( T ). This is because, in order to obtain a regret √ √ of O( T ), we need to set γ = Ω( T ), which unfortunately will lead to a blowup of the 250 gradients and consequently a poor regret bound. Using a primal-dual updating schema allows √ us to adjust the penalization term more carefully to obtain an O(1/ T ) convergence rate. Theorem 9.7. For any general convex function f (w), if we set ηt = γ/(2G22 ), t = 1, · · · , T , √ 2 and γ = G2 / (G21 + C22 + (1 + ln(2/δ))σ 2 )T in Algorithm 12, under assumptions A1-A3, we have, with a probability at least 1 − δ, ( f (wT ) ≤ min f (w) + O w∈W 1 √ T ) , where O(·) suppresses polynomial factors that depend on ln(2/δ), G1 , G2 , C2 , ρ, and σ. 9.3.2 Strongly Convex Functions with Smoothing We first emphasize that it is difficult to extend Algorithm 12 to achieve an O(ln T /T ) convergence rate for strongly convex optimization. This is because although the function −L(w, λ) is strongly convex in λ, its modulus for strong convexity is γ, which is too small to obtain an O(ln T ) regret bound. To achieve a faster convergence rate for strongly convex optimization, we change assumptions A1 and A2 as follows. Assumption 9.8. A4 ∥g(w, ξt )∥2 ≤ G1 , ∥∇g(w)∥2 ≤ G2 , ∀w ∈ B, where we slightly abuse the same notation G1 . Note that A1 only requires that ∥∇f (w)∥2 is bounded and A2 assumes a mild condition 251 on the stochastic gradient. In contrast, for strongly convex optimization we need to assume a bound on the stochastic gradient ∥g(w, ξt )∥2 . Although assumption A4 is stronger than assumptions A1 and A2, however, it is always possible to bound the stochastic gradient for machine learning problems where f (w) usually consists of a summation of loss functions on training examples, and the stochastic gradient is computed by sampling over the training examples. Given the bound on ∥g(w, ξt )∥2 , we can easily have ∥∇f (w)∥2 = ∥Eg(w, ξt )∥2 ≤ E∥g(w, ξt )∥2 ≤ G1 , which is used to set an input parameter λ0 > G1 /ρ to the algorithm. According to the discussion in the last subsection, we know that the optimal dual variable λ∗ is upper bounded by G1 /ρ, and consequently is upper bounded by λ0 . Similar to the last approach, we write the optimization problem (9.1) into an equivalent convex-concave optimization problem: min f (w) = min max f (w) + λg(w) = min f (w) + λ0 [g(w)]+ . g(w)≤0 w∈B 0≤λ≤λ0 w∈B To avoid unnecessary complication due to the subgradient of [·]+ , following [123], we introduce a smoothing term H(λ/λ0 ), where H(p) = −p ln p − (1 − p) ln(1 − p) is the entropy function, into the Lagrangian function, leading to the optimization problem min F (w), where w∈B F (w) is defined as ( ( )) λ0 g(w) , F (w) = f (w) + max λg(w) + γH(λ/λ0 ) = f (w) + γ ln 1 + exp γ 0≤λ≤λ0 where γ > 0 is a parameter whose value will be determined later. Given the smoothed objective function F (w), we find the optimal solution by applying SGD to minimize F (w), 252 Algorithm 13 SGD with ONE Projection by a Smoothing Technique (SGD-ST) 1: Input: a sequence of step sizes {ηt }, λ0 , and γ 2: Initialize: w1 = 0. 3: for t = 1, . . . , T do 4: Compute ′ wt+1 = wt − ηt ( g(wt , ξt ) + exp (λ0 g(wt )/γ) λ ∇g(wt ) 1 + exp(λ0 g(wt )/γ) 0 ) ′ / max(∥w′ ∥ , 1) 5: Update wt+1 = wt+1 t+1 2 6: end for ∑T 7: Return: wT = ΠW (wT ), where wT = t=1 wt /T . where the gradient of F (w) is computed by ∇F (w) = ∇f (w) + exp (λ0 g(w)/γ) λ ∇g(w). 1 + exp (λ0 g(w)/γ) 0 (9.12) Algorithm 13 gives the detailed steps. Unlike Algorithm 12, only the primal variable w is updated in each iteration using the stochastic gradient computed in (9.12). The following theorem shows that Algorithm 13 achieves an O(ln T /T ) convergence rate if the cost functions are strongly convex. Theorem 9.9. For any β-strongly convex function f (w), if we set ηt = 1/(2βt), t = 1, . . . , T , γ = ln T /T , and λ0 > G1 /ρ in Algorithm 13, under assumptions A3 and A4, we have with a probability at least 1 − δ, ( f (wT ) ≤ min f (w) + O w∈W ln T T ) , where O(·) suppresses polynomial factors that depend on ln(1/δ), 1/β, G1 , G2 , ρ, and λ0 . It is well known that the optimal convergence rate of SGD for strongly convex optimization is O(1/T ) [72] which has been proven to be tight in stochastic optimization setting [4]. 253 According to Theorem 9.9, Algorithm 13 achieves an almost optimal convergence rate except for the factor of ln T . It is worth mentioning that although it is not explicitly given in Theorem 9.9, the detailed expression for the convergence rate of Algorithm 13 exhibits a tradeoff in setting λ0 (more can be found in the proof of Theorem 9.9). Finally, under √ assumptions A1-A3, Algorithm 13 can achieve an O(1/ T ) convergence rate for general convex functions, similar to Algorithm 12. 9.4 Analysis of Convergence Rate We here present the proofs of main theorems. The omitted proofs of some results are deferred to Section 9.6. 9.4.1 Convergence Rate for General Convex Functions To pave the path for the proof of of Theorem 9.7, we present a series of lemmas. The lemma below states two key inequalities for the proof, which follows the standard analysis of gradient descent. Lemma 9.10. Under the bounded assumptions in (9.8) and (9.8), for any w ∈ B and λ > 0, we have ⟨wt − w, ∇w L(wt , λt )⟩ ≤ ) 1 ( ∥w − wt ∥22 − ∥w − wt+1 ∥22 + 2ηt G21 + ηt G22 λ2t 2ηt + 2ηt ∥g(wt , ξt ) − ∇f (wt )∥22 + ⟨w − wt , g(wt , ξt ) − ∇f (wt )⟩, ≡∆t ) 1 ( 2 2 (λ − λt )∇λ L(wt , λt ) ≤ |λ − λt | − |λ − λt+1 | + 2ηt C22 . 2ηt 254 ≡ζt (w) An immediate result of Lemma 9.10 is the following which states a regret-type bound. Lemma 9.11. For any general convex function f (w), if we set ηt = γ/(2G22 ), t = 1, · · · , T , we have T ∑ (f (wt ) − f (w∗ )) + t=1 T T ∑ G22 (G21 + C22 ) γ ∑ ≤ + γT + 2 ∆t + ζt (w∗ ), γ G22 G2 t=1 t=1 ∑ [ Tt=1 g(wt )]2+ 2(γT + 2G22 /γ) (9.13) where w∗ = arg minw∈W f (w). Proof. (Proof of Theorem 9.7) First, by martingale inequality (e.g., Lemma 4 in [94]), with √ √ ∑ a probability 1 − δ/2, we have Tt=1 ζt (w∗ ) ≤ 2σ 3 ln(2/δ) T . By Markov’s inequality (Lemma A.18 in Appendix A.2), with a probability 1 − δ/2, we have ∑T t=1 ∆t ≤ (1 + ln(2/δ))σ 2 T . Substituting these inequalities into Lemma 9.11, plugging the stated value of γ, and using O(·) notation for ease of exposition, we have with a probability 1 − δ T ∑ T √ ]2 1 [∑ (f (wt ) − f (w∗ )) + √ g(wt ) + ≤ O( T ), C T t=1 t=1 √ √ 2 2 2 where C = 2G2 (1/ G1 + C2 + (1 + ln(2/δ))σ + 2 G21 + C22 + (1 + ln(2/δ))σ 2 ) and O(·) suppresses polynomial factors that depend on ln(2/δ), G1 , G2 , C2 , σ. Recalling the definition of wT = ∑T t=1 wt /T and using the convexity of f (w) and g(w), we have √ ( ) 1 T 2 f (wT ) − f (w∗ ) + [g(wT )]+ ≤ O √ . C T 255 (9.14) Assume g(wT ) > 0, otherwise wT = wT and we easily have f (wT ) ≤ minw∈W f (w) + √ O(1/ T ). Since wT is the projection of wT into W, i.e., wT = arg ming(w)≤0 ∥w − wT ∥22 , then by first order optimality condition, there exists a positive constant s > 0 such that g(wT ) = 0, and wT − wT = s∇g(wT ) ˜ T ). Hence, which indicates that wT − wT is in the same direction to ∇g(w g(wT ) = g(wT ) − g(wT ) ≥ (wT − wT )⊤ ∇g(wT ) = ∥wT − wT ∥2 ∥∇g(wT )∥2 (9.15) ≥ ρ∥wT − wT ∥2 , where the last inequality follows the definition of ming(w)=0 ∥∇g(w)∥2 ≥ ρ. Additionally, we have f (w∗ ) − f (wT ) ≤ f (w∗ ) − f (wT ) + f (wT ) − f (wT ) ≤ G1 ∥wT − wT ∥2 , (9.16) due to f (w∗ ) ≤ f (wT ) and Lipschitz continuity of f (w). Combining inequalities (9.14), (9.15), and (9.16) yields √ ρ2 √ T ∥wT − wT ∥22 ≤ O(1/ T ) + G1 ∥wT − wT ∥2 . C G C By simple algebra, we have ∥wT − wT ∥2 ≤ 21√ + O ρ T 256 (√ ) C ρ2 T Therefore f (wT ) ≤ f (wT ) − f (wT ) + f (wT ) ≤ G1 ∥wT − wT ∥2 + f (w∗ ) + O ( ) 1 ≤ f (w∗ ) + O √ , T ( 1 √ T ) (9.17) where we use the inequality in (9.14) to bound f (wT ) by f (w∗ ) and absorb the dependence on ρ, G1 , C into the O(·) notation. Remark 9.12. From the proof of Theorem 9.7, we can see that the key inequalities are (9.14), (9.15), and (9.16). In particular, the regret-type bound in (9.14) depends on the algorithm. If we only update the primal variable using the penalized objective in (9.11), whose gradient depends on 1/γ, it will cause a blowup in the regret bound with (1/γ + γT + T /γ), which leads to a non-convergent bound. 9.4.2 Convergence Rate for Strongly Convex Functions Our proof of Theorem 9.9 for the convergence rate of Algorithm 13 when applied to strongly convex functions starts with the following lemma by analogy of Lemma 9.11. Lemma 9.13. For any β-strongly convex function f (w), if we set ηt = 1/(2βt), we have T T T ∑ (G21 + λ20 G22 )(1 + ln T ) ∑ β∑ + ζt (w∗ ) − ∥w∗ − wt ∥22 (F (w) − F (w∗ )) ≤ 2β 4 t=1 t=1 t=1 where w∗ = arg minw∈W f (w). In order to prove Theorem 9.9, we need the following result for an improved martingale inequality. 257 Lemma 9.14. For any fixed w ∈ B, define DT = ∑T ∑T 2 t=1 ∥wt − w∥2 , ΛT = t=1 ζt (w), and m = ⌈log2 T ⌉. We have √ ( ) ( ) 4 m m Pr DT ≤ + Pr ΛT ≤ 4G1 DT ln + 4G1 ln ≥ 1 − δ. T δ δ Proof of Theorem 9.9. We substitute the bound in Lemma 9.14 into the inequality in Lemma 9.13 with w = w∗ . We consider two cases. In the first case, we assume DT ≤ 4/T . As a result, we have T ∑ ζt (w∗ ) = t=1 T ∑ (∇f (wt ) − g(wt , ξt ))⊤ (w∗ − wt ) ≤ 2G1 √ T DT ≤ 4G1 , t=1 which together with the inequality in Lemma 9.13 leads to the bound T ∑ (F (wt ) − F (w∗ )) ≤ 4G1 + t=1 (G21 + λ20 G22 )(1 + ln T ) . 2β In the second case, we assume T ∑ √ ζt (w∗ ) ≤ 4G1 t=1 m m β DT ln + 4G1 ln ≤ DT + δ δ 4 ( ) 16G21 m + 4G1 ln , β δ √ where the last step uses the fact 2 ab ≤ a2 + b2 . We thus have T ∑ t=1 ( (F (wt ) − F (w∗ )) ≤ 16G21 + 4G1 β ) ln m (G21 + λ20 G22 )(1 + ln T ) + . δ 2β Combing the results of the two cases, we have, with a probability 1 − δ, 258 T ∑ t=1 ( (F (wt ) − F (w∗ )) ≤ 16G21 + 4G1 β ) ln (G2 + λ20 G22 )(1 + ln T ) m . + 4G1 + 1 δ 2β (9.18) O(ln T ) By convexity of F (w), we have F (wT ) ≤ F (w∗ ) + O (ln T /T ). Noting that w∗ ∈ W, g(w∗ ) ≤ 0, we have F (w∗ ) ≤ f (w∗ ) + γ ln 2. On the other hand, ( ( )) λ0 g(wT ) F (wT ) = f (wT ) + γ ln 1 + exp ≥ f (wT ) + max (0, λ0 g(wT )) . γ Therefore, with the value of γ = ln T /T , we have ( f (wT ) ≤ f (w∗ ) + O ln T T ) , f (wT ) + λ0 g(wT ) ≤ f (w∗ ) + O ( ln T T ) (9.19) . Applying the inequalities (9.15) and (9.16) to (9.19), and noting that γ = ln T /T , we have ( λ0 ρ∥wT − wT ∥2 ≤ G1 ∥wT − wT ∥2 + O ln T T ) . For λ0 > G1 /ρ, we have ∥wT − wT ∥2 ≤ (1/(λ0 ρ − G1 ))O(ln T /T ). Therefore ( f (wT ) ≤ f (wT )−f (wT )+f (wT ) ≤ G1 ∥wT −wT ∥2 +f (w∗ )+O where in the second inequality we use inequality (9.19). 259 ln T T ) ( ≤ f (w∗ )+O ln T T ) , 9.5 Online Optimization with Soft Constraints We now turn to making online learning algorithms more efficient by excluding the projection steps. To this end, we consider an alternative online learning problem in which, instead of requiring that each solution obeys the constraints, we only require the constraints, which define the convex domain W, to be satisfied in a long run. Then, the online learning problem becomes a task to find a sequence of solutions under the long term constraints. We present and analyze an online gradient descent based algorithm for online convex optimization with soft constraints. To facilitate our analysis, similar to the stochastic setting described in Section 9.1, we assume that the domain W can be written as an intersection of a finite number of convex constraints, that is, W = {w ∈ Rd : gi (w) ≤ 0, i ∈ [m]}, where gi (w), i ∈ [m], are Lipschitz continuous functions. We assume that W is a bounded domain, that is, there exist constants R > 0 and r < 1 such that W ⊆ RB and rB ⊆ W where B denotes the unit ℓ2 ball centered at the origin. We focus on the problem of online convex optimization as introduced in Chapter 2, in which the goal is to achieve a low regret with respect to a fixed decision on a sequence of adversarially chosen cost functions. The difference between the setting considered here and the general online convex optimization is that, in our setting, instead of requiring wt ∈ W, or equivalently gi (wt ) ≤ 0, i ∈ [m], for all t ∈ [T ], we only require the constraints to be satisfied in the long run, namely ∑T t=1 gi (wt ) ≤ 0, i ∈ [m]. Then, the problem becomes to find a sequence of solutions wt , t ∈ [T ] that minimizes the regret, under the long term constraints ∑T t=1 gi (wt ) ≤ 0, i ∈ [m]. Formally, we would like to solve the following optimization 260 problem online, min T ∑ w1 ,...,wT ∈B t=1 ft (wt ) − min w∈W T ∑ ft (w) s.t. t=1 T ∑ gi (wt ) ≤ 0 , i ∈ [m]. (9.20) t=1 For simplicity, we will focus on a finite-horizon setting where the number of rounds T is known in advance. This condition can be relaxed under certain conditions, using standard techniques (see, e.g., [38]). Note that in (9.20), (i) the solutions come from the ball B ⊇ W instead of W and (ii) the constraint functions are fixed and are given in advance. 9.5.1 An Impossibility Theorem Before we state our formulation and algorithms, let us review a few alternative techniques that do not need explicit projection. A straightforward approach is to introduce an appropriate self-concordant barrier function for the given convex set W and add it to the objective function such that the barrier diverges at the boundary of the set. Then we can interpret the resulting optimization problem, on the modified objective functions, as an unconstrained minimization problem that can be solved without projection steps. Following the analysis in [3], with an appropriately designed procedure for updating solutions, we could guarantee √ a regret bound of O( T ) without the violation of constraints. A similar idea is used in [2] for online bandit learning and in [114] for a random walk approach for regret minimization which, in fact, translates the issue of projection into the difficulty of sampling. Even for linear Lipschitz cost functions, the random walk approach requires sampling from a Gaussian distribution with covariance given by the Hessian of the self-concordant barrier of the convex set W that has the same time complexity as inverting a matrix. The main limitation with these approaches is that they require computing the Hessian matrix of the objective 261 function in order to guarantee that the updated solution stays within the given domain W. This limitation makes it computationally unattractive when dealing with high dimensional data. In addition, except for well known cases, it is often unclear how to efficiently construct a self-concordant barrier function for a general convex domain. An alternative approach for online convex optimization with long term constraints is to introduce a penalty term in the loss function that penalizes the violation of constraints. More specifically, we can define a new loss function fˆt (w) as fˆt (w) = ft (w) + δ m ∑ [gi (w)]+ , (9.21) i=1 where [z]+ = max(0, 1−z) and δ > 0 is a fixed positive constant used to penalize the violation of constraints. We then run the standard OGD algorithm from Chapter 2 to minimize the modified loss function fˆt (w). The following theorem shows that this simple strategy fails to achieve sub-linear bound for both regret and the long term violation of constraints at the same time. Theorem 9.15. Given δ > 0, there always exists a sequence of loss functions f1 , f2 , · · · , fT ∑T ∑T and a constraint function g(w) such that either f (w ) − min t t g(w)≤0 t=1 t=1 ft (w) = O(T ) or ∑T t=1 [g(wt )]+ = O(T ) holds, where w1 , w2 , · · · , wT is the sequence of solutions generated by the OGD algorithm that minimizes the modified loss functions given in (9.21). Proof. We first show that when δ < 1, there exists a loss function and a constraint function such that the violation of constraint is linear in T . To see this, we set ft (w) = w⊤ w, t ∈ [T ] and g(w) = 1 − w⊤ w. Assume we start with an infeasible solution, that is, g(w1 ) > 0 or w1⊤ w < 1. Given the solution wt obtained at tth trial, using the standard gradient descent approach, we have wt+1 = wt − η(1 − δ)w. Hence, if wt⊤ w < 1, since we have 262 ⊤ w < w⊤ w < 1, if we start with an infeasible solution, all the solutions obtained over wt+1 t the trails will violate the constraint g(w) ≤ 0, leading to a linear number of violation of constraints. Based on this analysis, we assume δ > 1 in the analysis below. Given a strongly convex loss function f (w) with modulus γ, we consider a constrained optimization problem given by min f (w), g(w)≤0 which is equivalent to the following unconstrained optimization problem min f (w) + λ[g(w)]+ , w where λ ≥ 0 is the Lagrangian multiplier. Since we can always scale f (w) to make λ ≤ 1/2, it is safe to assume λ ≤ 1/2 < δ. Let w∗ and wa be the optimal solutions to the constrained optimization problems arg ming(w)≤0 f (w) and arg min f (w) + δ[g(w)]+ , respectively. We w choose f (w) such that ∥∇f (w∗ )∥ > 0, which leads to wa ̸= w∗ . This holds because according to the first order optimality condition, we have ∇f (w∗ ) = −λ∇g(w∗ ), ∇f (wa ) = −δ∇g(w∗ ), and therefore ∇f (w∗ ) ̸= ∇f (wa ) when λ < δ. Define ∆ = f (wa ) − f (w∗ ). Since ∆ ≥ γ∥wa − w∗ ∥2 /2 due to the strong convexity of f (w), we have ∆ > 0. Let {wt }Tt=1 be the sequence of solutions generated by the OGD algorithm that minimizes 263 the modified loss function f (w) + δ[g(w)]+ . We have T ∑ f (wt ) + δ[g(wt )]+ ≥ T min f (w) + δ[g(w)]+ w t=1 = T (f (wa ) + δ[g(wa )]+ ) ≥ T (f (wa ) + λ[g(wa )]+ ) = T (f (w∗ ) + λ[g(w∗ )]+ ) + T (f (wa ) + λ[g(wa )]+ − f (w∗ ) − λ[g(w∗ )]) ≥ T min f (w) + T ∆. g(w)≤0 As a result, we have T ∑ f (wt ) + δ[g(wt )]+ − min f (w) = O(T ), g(w)≤0 t=1 implying that either the regret ∑T t=1 f (wt ) − T f (w∗ ) or the violation of the constraints ∑T t=1 [g(w)]+ is linear in T . To better understand the performance of penalty based approach, here we analyze the performance of the OGD in solving the online optimization problem in (9.20). The algorithm is analyzed using the following lemma from [158] (see also Theorem 2.8 in Chapter 2). Lemma 9.16. Let w1 , w2 , . . . , wT ∈ W be the sequence of solutions obtained by applying OGD on the sequence of bounded convex functions f1 , f2 , . . . , fT . Then, for any solution w∗ ∈ W we have T ∑ t=1 ft (wt ) − T ∑ t=1 R2 η ∑ + ft (w∗ ) ≤ ∥∇ft (wt )∥2 . 2η 2 T t=1 We apply OGD to functions fˆt (w), t ∈ [T ] defined in (9.21), that is, instead of updating the solution based on the gradient of ft (w), we update the solution by the gradient of fˆt (w). 264 Using Lemma 9.16, by expanding the functions fˆt (w) based on (9.21) and considering the ∑ 2 fact that m i=1 [gi (w∗ )]+ = 0, we get T ∑ ft (wt ) − t=1 T ∑ t=1 δ ∑∑ R2 η ∑ [gi (w)]2+ ≤ + ∥∇fˆt (wt )∥2 . ft (w∗ ) + 2 2η 2 T m T t=1 i=1 (9.22) t=1 From the definition of fˆt (w), the norm of the gradient ∇fˆt (wt ) is bounded as follows ∥∇fˆt (w)∥2 = ∥∇ft (w) + δ m ∑ [gi (w)]+ ∇gi (w)∥2 ≤ 2G2 (1 + mδ 2 D2 ), (9.23) i=1 where the inequality holds because (a1 + a2 )2 ≤ 2(a21 + a22 ). By substituting (9.23) into the (9.22) we have: T ∑ ft (wt ) − t=1 T ∑ δ ∑∑ R2 [gi (wt )]2+ ≤ + ηG2 (1 + mδ 2 D2 )T. 2 2η T ft (w∗ ) + t=1 m (9.24) t=1 i=1 Since [·]2+ is a convex function, from Jensen’s inequality and following the fact that ∑T t=1 ft (wt )− ft (w∗ ) ≥ −F T , we have:  2 T m ∑ m T ∑ δ ∑∑ δ R2 2   gi (wt ) ≤ [gi (wt )]+ ≤ + ηG2 (1 + mδ 2 D2 )T + F T. 2T 2 2η i=1 t=1 i=1 t=1 + By minimizing the right hand side of (9.24) with respect to η, we get the regret bound as T ∑ t=1 ft (wt ) − T ∑ ft (w∗ ) ≤ RG √ √ 2(1 + mδ 2 D2 )T = O(δ T ) t=1 265 (9.25) and the bound for the violation of constraints as T ∑ t=1 √( ) 2T R2 2 2 2 gi (wt ) ≤ + ηG (1 + mδ D )T + F T = O(T 1/4 δ 1/2 + T δ −1/2 ). (9.26) 2η δ By examining the bounds obtained in (9.25) and (9.26), it turns out that in order to recover √ O( T ) regret bound, we need to set δ to be a constant, leading to O(T ) bound for the violation of constraints in the long run, which is not satisfactory at all. The analysis shows √ that in order to obtain O( T ) regret bound, linear bound on the long term violation of the constraints is unavoidable. The main reason for the failure of using modified loss function in (9.21) is that the weight constant δ is fixed and independent from the sequence of solutions obtained so far. In the next subsection, we present an online convex-concave formulation for online convex optimization with long term constraints, which explicitly addresses the limitation of (9.21) by automatically adjusting the weight constant based on the violation of the solutions obtained so far. As mentioned before, our general strategy is to turn online convex optimization with long term constraints into a convex-concave optimization problem. Instead of generating a sequence of solutions that satisfies the long term constraints, we first consider an online optimization strategy that allows the violation of constraints on some rounds in a controlled way. We then modify the online optimization strategy to obtain a sequence of solutions that obeys the long term constraints. Although the online convex optimization with long term constraints is clearly easier than the standard online convex optimization problem, it is straightforward to see that optimal regret bound for online optimization with long term √ constraints should be on the order of O( T ), no better than the standard online convex optimization problem. 266 9.5.2 √ An Efficient Algorithm with O( T ) Regret Bound and O(T 3/4 ) Bound on the Violation of Constraints The intuition behind our approach stems from the observation that the constrained optimization problem minw∈W ∑T t=1 ft (w) is equivalent to the following convex-concave optimization problem min max w∈B λ∈Rm + T ∑ ft (w) + t=1 m ∑ λi gi (w), (9.27) i=1 where λ = (λ1 , . . . , λm )⊤ is the vector of Lagrangian multipliers associated with the constraints gi (·), i = 1, . . . , m and belongs to the nonnegative orthant Rm + . To solve the online convex-concave optimization problem, we extend the gradient based approach for variational inequality [116] to (9.27). To this end, we consider the following regularized convex-concave function as } m { ∑ δη 2 Lt (w, λ) = ft (w) + λi gi (w) − λi , 2 (9.28) i=1 where δ > 0 is a constant whose value will be decided by the analysis. Note that in (9.28), we introduce a regularizer δηλ2i /2 to prevent λi from being too large. This is because, when λi ∑ is large, we may encounter a large gradient for w because of ∇w Lt (w, λ) ∝ m i=1 λi ∇gi (w), leading to unstable solutions and a poor regret bound. Although we can achieve the same goal by restricting λi to a bounded domain, using the quadratic regularizer makes it convenient for our analysis. 267 Algorithm 14 shows the detailed steps of the proposed algorithm. Unlike standard online convex optimization algorithms that only update w, Algorithm 14 updates both w and λ. In addition, unlike the modified loss function in (9.21) where the weights for constraints m {gi (w) ≤ 0}m i=1 are fixed, Algorithm 14 automatically adjusts the weights {λi }i=1 based on {gi (w)}m i=1 , the violation of constraints, as the game proceeds. It is this property that allows Algorithm 14 to achieve sub-linear bound for both regret and the violation of constraints. To analyze Algorithm 14, we first state the following lemma, the key to the main theorem on the regret bound and the violation of constraints. Lemma 9.17. Let Lt (·, ·) be the function defined in (9.28) which is convex in its first argument and concave in its second argument. Then for any (w, λ) ∈ B × Rm + we have Lt (wt , λ) − Lt (w, λt ) ≤ 1 (∥w − wt ∥2 + ∥λ − λt ∥2 − ∥w − wt+1 ∥2 − ∥λ − λt+1 ∥2 ) 2η η + (∥∇w Lt (wt , λt )∥2 + ∥∇λ Lt (wt , λt )∥2 ). 2 Proof. Following the analysis of [158], convexity of Lt (·, λ) implies that Lt (wt , λt ) − Lt (w, λt ) ≤ ⟨wt − w, ∇w Lt (wt , λt )⟩ (9.29) and by concavity of Lt (w, ·) we have Lt (wt , λ) − Lt (wt , λt ) ≤ ⟨λ − λt , ∇λ Lt (wt , λt )⟩. (9.30) Combining the inequalities (9.29) and (9.30) results in Lt (wt , λ) − Lt (w, λt ) ≤ ⟨wt − w, ∇w Lt (wt , λt )⟩ − ⟨λ − λt , ∇λ Lt (wt , λt )⟩. 268 (9.31) Algorithm 14 Online Gradient Descent with Soft Constraints 1: Input: • constraints gi (w) ≤ 0, i ∈ [m] • step size η • constant δ > 0 2: Initialize: w1 = 0 and λ1 = 0 3: for t = 1, 2, . . . , T do 4: Submit solution wt 5: Receive the convex function ft (·) and suffer loss ft (wt ) 6: Compute the gradients as: ∇w Lt (wt , λt ) = ∇ft (wt ) + m ∑ λit ∇gi (wt ) i=1 ∇λi Lt (wt , λt ) = gi (wt ) − ηδλit 7: Update wt and λt by: wt+1 = ΠB (wt − η∇w Lt (wt , λt )) λt+1 = Π[0,+∞)m (λt + η∇λ Lt (wt , λt )) 8: end for Using the update rule for wt+1 in terms of wt and expanding, we get ∥w − wt+1 ∥2 ≤ ∥w − wt ∥2 − 2η⟨wt − w, ∇w Lt (wt , λt )⟩ + η 2 ∥∇w Lt (wt , λt )∥2 , (9.32) where the first inequality follows from the nonexpansive property of the projection operation (see Lemma A.5). Expanding the inequality for ∥λ − λt+1 ∥2 in terms of λt and plugging back into the (9.31) with (9.32) establishes the desired inequality. Proposition 9.18. Let wt and λt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 14. Then for any w ∈ B and λ ∈ Rm + , we have 269 T ∑ Lt (wt , λ) − Lt (w, λt ) t=1 T ) η( )∑ R2 + ∥λ∥2 ηT ( 2 2 2 2 2 ≤ + (m + 1)G + 2mD + (m + 1)G + 2mδ η ∥λt ∥2 . 2η 2 2 t=1 Proof. We first bound the gradient terms in the right hand side of Lemma 9.17. Using the inequality (a1 + a2 + . . . , an )2 ≤ n(a21 + a22 + . . . + a2n ), we have ∥∇w Lt (wt , λt )∥2 ≤ ( ) (m + 1)G2 1 + ∥λt ∥2 and ∥∇λ Lt (wt , λt )∥2 ≤ 2m(D2 + δ 2 η 2 ∥λt ∥2 ). In Lemma 9.17, by adding the inequalities of all iterations, and using the fact ∥w∥ ≤ R we complete the proof. The following theorem bounds the regret and the violation of the constraints in the long run for Algorithm 14. √ √ Theorem 9.19. Define a = R (m + 1)G2 + 2mD2 . Set η = R2 /[a T ]. Assume T is √ large enough such that 2 2η(m + 1) ≤ 1. Choose δ such that δ ≥ (m + 1)G2 + 2mδ 2 η 2 . Let w1 , w2 , · · · , wT be the sequence of solutions obtained by Algorithm 14. Then for the optimal ∑ solution w∗ = minw∈W Tt=1 ft (w) we have T ∑ t=1 T ∑ √ ft (wt ) − ft (w∗ ) ≤ a T = O(T 1/2 ), and √ gi (wt ) ≤ ( √ )√ 2 FT + a T T t=1 ( δR2 ma + 2 a R ) = O(T 3/4 ). Proof. We begin by expanding (9.33) using (9.28) and rearranging the terms to get 270 T ∑  [ft (wt ) − ft (w)] + t=1 m  ∑ T ∑ i=1 t=1 λ  i gi (wt ) − T ∑ t=1   δηT i ∥λ∥2 λt gi (w) −  2 T ) δη ∑ R2 + ∥λ∥2 ηT ( ≤− ∥λt ∥2 + + (m + 1)G2 + 2mD2 2 2η 2 t=1 T )∑ η( 2 2 2 + (m + 1)G + 2mδ η ∥λt ∥2 . 2 t=1 Since δ ≥ (m + 1)G2 + 2mδ 2 η 2 , we can drop the ∥λt ∥2 terms from both sides of the above inequality and obtain T ∑ [ft (wt ) − ft (w)] + t=1  m  ∑ i=1 ≤  m ∑ T ∑ λi T ∑ ( gi (wt ) − t=1 λit gi (w) + i=1 t=1 δηT m + 2 2η ) λ2i    ) R2 ηT ( + (m + 1)G2 + 2mD2 ) . 2η 2 The left hand side of above inequality consists of two terms. The first term basically measures the difference between the cumulative loss of the Algorithm 14 and the optimal solution and the second term includes the constraint functions with corresponding Lagrangian multipliers which will be used to bound the long term violation of the constraints. By taking maximization for λ over the range (0, +∞), we get   T T  ∑ ∑ + i − λt gi (w) [ft (wt ) − ft (w)] +  2(δηT + m/η)   t=1 t=1 i=1  ) R2 ηT ( 2 2 + (m + 1)G + 2mD ) . ≤ 2η 2  [∑ ]2 T m  g (w )  t ∑ i t=1 271 Since w∗ ∈ W, we have gi (w∗ ) ≤ 0, i ∈ [m], and the resulting inequality becomes T ∑ ft (wt ) − ft (w∗ ) + t=1 m ∑ i=1 [∑ ]2 T g (w ) t=1 i t + ) R2 ηT ( 2 2 ≤ + (m + 1)G + 2mD ) . 2(δηT + m/η) 2η 2 The statement of the first part of the theorem follows by using the expression for η. The second part is proved by substituting the regret bound by its lower bound as ∑T t=1 ft (wt ) − ft (w∗ ) ≥ −F T . Remark 9.20. We observe that the introduction of quadratic regularizer δη∥λ∥2 /2 allows [∑ ]2 ∑ T us to turn the expression λi Tt=1 gi (wt ) into g (w ) , leading to the bound for the t=1 i t + violation of the constraints. In addition, the quadratic regularizer defined in terms of λ allows us to work with unbounded λ because it cancels the contribution of the ∥λt ∥ terms from the loss function and the bound on the gradients ∥∇w Lt (w, λ)∥. Note that the constraint for δ mentioned in Theorem 9.19 is equivalent to 2 √ ≤δ≤ 1/(m + 1) + (m + 1)−2 − 8G2 η 2 1/(m + 1) + √ (m + 1)−2 − 8G2 η 2 , 4η 2 (9.33) from which, when T is large enough (i.e., η is small enough), we can simply set δ = 2(m + 1)G2 that will obey the constraint in (9.33). By investigating Lemma 9.17, it turns out that the boundedness of the gradients is essential to obtain bounds for Algorithm 14 in Theorem 9.19. Although, at each iteration, λt is projected onto the Rm + , since W is a compact set and functions ft (w) and gi (w), i ∈ [m] are convex, the boundedness of the functions implies that the gradients are bounded [22, Proposition 4.2.3]. 272 9.5.3 An Efficient Algorithm with O(T 3/4 ) Regret Bound and without Violation of Constraints In this subsection we generalize Algorithm 14 such that the constrained are satisfied in a long run. To create a sequence of solutions {wt , t ∈ [T ]} that satisfies the long term constraints ∑T t=1 gi (wt ) ≤ 0, i ∈ [m], we make two modifications to Algorithm 14. First, instead of handling all of the m constraints, we consider a single constraint defined as g(w) = maxi∈[m] gi (w). Apparently, by achieving zero violation for the constraint g(w) ≤ 0, it is guaranteed that all of the constraints gi (·), i ∈ [m] are also satisfied in the long term. Furthermore, we change Algorithm 14 by modifying the definition of Lt (·, ·) as Lt (w, λ) = ft (w) + λ(g(w) + γ) − ηδ 2 λ , 2 (9.34) where γ > 0 will be decided later. This modification is equivalent to considering the constraint g(w) ≤ −γ, a tighter constraint than g(w) ≤ 0. The main idea behind this modification is that by using a tighter constraint in our algorithm, the resulting sequence of solutions will satisfy the long term constraint ∑T t=1 g(wt ) ≤ 0, even though the tighter constraint is violated in many trials. Before proceeding, we state a fact about the Lipschitz continuity of the function g(w) in the following proposition. Proposition 9.21. Assume that functions gi (·), i ∈ [m] are Lipschitz continuous with constant G. Then, function g(w) = maxi∈[m] gi (w) is Lipschitz continuous with constant G, that is, |g(w) − g(w′ )| ≤ G∥w − w′ ∥ for any w ∈ B and w′ ∈ B. 273 Proof. See Appendix ??. To obtain a zero bound on the violation of constraints in the long run, we make the following assumption about the constraint function g(w). Assumption 9.22. Let W ′ ⊆ W be the convex set defined as W ′ = {w ∈ Rd : g(w)+γ ≤ 0} where γ ≥ 0. We assume that the norm of the gradient of the constraint function g(w) is lower bounded at the boundary of W ′ , that is, A5 min g(w)+γ=0 ∥∇g(w)∥ ≥ σ. A direct consequence of assumption A5 is that by reducing the domain W to W ′ , the optimal value of the constrained optimization problem minw∈W f (w) does not change much, as revealed by the following theorem. Theorem 9.23. Let w∗ and wγ be the optimal solutions to the constrained optimization problems defined as ming(w)≤0 f (w) and ming(w)≤−γ f (w), respectively, where f (w) = ∑T t=1 ft (w) and γ ≥ 0. We have |f (w∗ ) − f (wγ )| ≤ G γT. σ Proof. We note that the optimization problem ming(w)≤−γ f (w) = ming(w)≤−γ ∑T t=1 ft (w), can also be written in the minimax form as f (wγ ) = min max w∈B λ∈R+ T ∑ ft (w) + λ(g(w) + γ), t=1 274 (9.35) where we use the fact that W ′ ⊆ W ⊆ B. We denote by wγ and λγ the optimal solutions to (9.35). We have f (wγ ) = min max T ∑ w∈B λ∈R+ = min w∈B ≤ T ∑ T ∑ ft (w) + λ(g(w) + γ) t=1 ft (w) + λγ (g(w) + γ) t=1 ft (w∗ ) + λγ (g(w∗ ) + γ) ≤ t=1 T ∑ ft (w∗ ) + λγ γ, t=1 where the second equality follows the definition of the wγ and the last inequality is due to the optimality of w∗ , that is, g(w∗ ) ≤ 0. To bound |f (wγ ) − f (w∗ )|, we need to bound λγ . Since wγ is the minimizer of (9.35), from the optimality condition we have − T ∑ ∇ft (wγ ) = λγ ∇g(wγ ). (9.36) t=1 By setting v = − ∑T t=1 ∇ft (wγ ), we can simplify (9.36) as λγ ∇g(wγ ) = v. From the KKT optimality condition [34], if g(wγ ) + γ < 0 then we have λγ = 0; otherwise according to Assumption 9.22 we can bound λγ by λγ ≤ ∥v∥ GT ≤ . ∥∇g(wγ )∥ σ We complete the proof by applying the fact f (w∗ ) ≤ f (wγ ) ≤ f (w∗ ) + λγ γ. As indicated by Theorem 9.23, when γ is small, we expect the difference between two 275 optimal values f (w∗ ) and f (wγ ) to be small. Using the result from Theorem 9.23, in the following theorem, we show that by running Algorithm 14 on the modified convex-concave functions defined in (9.34), we are able to obtain an O(T 3/4 ) regret bound and zero bound on the violation of constraints in the long run. Theorem 9.24. Set a = 2R/ √ √ 2G2 + 3(D2 + b2 ), η = R2 /[a T ], and δ = 4G2 . Let wt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 14 with functions defined in √ (9.34) with γ = bT −1/4 and b = 2 F (δR2 a−1 + aR−2 ). Let w∗ be the optimal solution to √ ∑ minw∈W Tt=1 ft (w). With sufficiently large T , that is, F T ≥ a T , and under Assump∑T tion 9.22, we have wt , t ∈ [T ] satisfy the global constraint t=1 g(wt ) ≤ 0 and the regret is bounded by RegretT = T ∑ t=1 √ b ft (wt ) − ft (w∗ ) ≤ a T + GT 3/4 = O(T 3/4 ). σ Proof. Let wγ be the optimal solution to ming(w)≤−γ ∑T t=1 ft (w). Similar to the proof of Theorem 9.19 when applied to functions in (9.34) we have T ∑ ft (wt ) − t=1 ≤ − T ∑ t=1 ft (w) + λ T ∑  (g(wt ) + γ) −  t=1 T ∑  λt  (g(w) + γ) − t=1 δηT 2 λ 2 T T ) η( )∑ δη ∑ 2 R2 + λ2 ηT ( 2 λt + + 2G + 3(D2 + γ 2 ) + 2G2 + 3δ 2 η 2 λ2t . 2 2η 2 2 t=1 t=1 By setting δ ≥ 2G2 + 3δ 2 η 2 which is satisfied by δ = 4G2 , we cancel the terms including λt from the right hand side of above inequality. By maximizing for λ over the range (0, +∞) and noting that γ ≤ b, for the optimal solution wγ , we have 276 T ∑ [ ] ft (wt ) − ft (wγ ) + [∑ ]2 T t=1 g(wt ) + γT + 2(δηT + 1/η) t=1 ≤ ) R2 ηT ( 2 + 2G + 3(D2 + b2 ) , 2η 2 which, by optimizing for η and applying the lower bound for the regret as ∑T t=1 ft (wt ) − ft (wγ ) ≥ −F T , yields the following inequalities T ∑ √ ft (wt ) − ft (wγ ) ≤ a T (9.37) t=1 and T ∑ √ g(wt ) ≤ ( √ )√ 2 FT + a T T t=1 ( δR2 a + 2 a R ) − γT, (9.38) for the regret and the violation of the constraint, respectively. Combining (9.37) with the √ ∑T ∑T result of Theorem 9.23 results in f (w ) ≤ f (w ) + a T + (G/σ)γT . By γ ∗ t t t=1 t=1 choosing γ = bT −1/4 we attain the desired regret bound as T ∑ t=1 √ bG 3/4 ft (wt ) − ft (w∗ ) ≤ a T + T = O(T 3/4 ). σ To obtain the bound on the violation of constraints, we note that in (9.38), when T is √ √ ∑ sufficiently large, that is, F T ≥ a T , we have Tt=1 g(wt ) ≤ 2 F (δR2 a−1 + aR−2 )T 3/4 − √ bT 3/4 . Choosing b = 2 F (δR2 a−1 + aR−2 )T 3/4 guarantees the zero bound on the violation of constraints as claimed. 277 9.6 Proofs of Convergence Rates In this section we provide the proof of the main lemmas omitted from the analysis of convergence rate. 9.6.1 Proof of Lemma 9.10 Following the standard analysis of gradient descent methods, we have for any w ∈ B, ′ ∥wt+1 − w∥22 − ∥wt − w∥22 ≤ ∥wt+1 − w∥22 − ∥wt − w∥22 = ∥wt − ηt (g(wt , ξt ) + λt ∇g(wt )) − w∥22 − ∥wt − w∥22 ≤ ηt2 ∥g(wt , ξt ) + λt ∇g(wt )∥22 − 2ηt (wt − w)⊤ (g(wt , ξt ) + λt ∇g(wt )) ≤ ηt2 ∥g(wt , ξt ) + λt ∇g(wt )∥22 − 2ηt (wt − w)⊤ (∇f (wt ) + λt ∇g(wt )) +2ηt (w − wt )⊤ (g(wt , ξt ) − ∇f (wt )), ≡∇w L(wt ,λt ) ≡ζt (w) Then we have (wt − w)⊤ ∇w L(wt , λt ) ) η 1 ( t ≤ ∥wt − w∥22 − ∥wt+1 − w∥22 + ∥g(wt , ξt ) + λt ∇g(wt )∥22 + ζt (w) 2ηt 2 ( ) 1 ≤ ∥wt − w∥22 − ∥wt+1 − w∥22 + ηt ∥g(wt , ξt )∥22 + ηt λ2t ∥∇g(wt )∥22 + ζt (w) 2ηt ) 1 ( 2 2 ≤ ∥wt − w∥2 − ∥wt+1 − w∥2 2ηt + 2ηt ∥g(wt , ξt ) − ∇f (wt )∥22 +2ηt ∥∇f (wt )∥22 + ηt λ2t ∥∇g(wt )∥22 + ζt (w) ≡∆t 278 By using the bound on ∥∇f (wt )∥2 and ∥∇g(wt )∥2 , we obtain the first inequality in Lemma 9.10. To prove the second inequality, we follow the same analysis, i.e., |λt+1 − λ|2 − |λt − λ|2 ≤ |λt + ηt (g(wt ) − γλt )|2 − |λt − λ|2 ≤ ηt2 |g(wt ) − γλt |2 + 2ηt (λt − λ) (g(wt ) − γλt ) . ≡∇λ L(wt ,λt ) Then we have (λ − λt )∇λ L(wt , λt ) ≤ ) η 1 ( t |λt − λ|2 − |λt+1 − λ|2 + |g(wt ) − γλt |2 . 2ηt 2 By induction, it is straightforward to show that λt ≤ C2 /γ, which yields the second inequality in Lemma 9.10, i.e., (λ − λt )∇λ L(wt , λt ) ≤ 9.6.2 ) 1 ( |λt − λ|2 − |λt+1 − λ|2 + 2ηt C22 . 2ηt Proof of Lemma 9.11 Since Lt (w, λ) is convex in w and concave in λ, we have the following inequalities L(w, λt ) − L(wt , λt ) ≥ (w − wt )⊤ ∇w L(wt , λt ), L(wt , λ) − L(wt , λt ) ≤ (λ − λt )∇λ L(wt , λt ). 279 Using the inequalities in Lemma 9.10, we have 1 2ηt 1 L(wt , λ) − L(wt , λt ) ≤ 2ηt L(wt , λt ) − L(w, λt ) ≤ ( ( ) ∥w − wt ∥22 − ∥w − wt+1 ∥22 + 2ηt G21 + ηt G22 λ2t + 2ηt ∆t + ζt (w), |λ − λt |2 − |λ − λt+1 |2 ) + 2ηt C22 , where ζt (w) = ⟨w − wt , g(wt , ξt ) − ∇f (wt )⟩ as abbreviated before. Since η1 = · · · = ηT , denoted by η, by taking summation of above two inequalities over t = 1, · · · , T , we get T ∑ t=1 ∑ ∑ ∑ ∥w∥22 λ2 L(wt , λ) − L(w, λt ) ≤ + + 2ηT (G21 + C22 ) + ηG22 λ2t + 2η ∆t + ζt (w). 2η 2η t T T t=1 t=1 By plugging the expression of L(w, λ), and due to ∥w∥2 ≤ 1, we have T ∑ t=1 ≤ (f (wt ) − f (w)) + λ T ∑ ( g(wt ) − t=1 γT 1 + 2 2η ) λ2 ∑ ∑ ∑ ∑ 1 + 2ηT (G21 + C22 ) + (ηG22 − γ/2)λ2t + λt g(w) + 2η ∆t + ζt (w). 2η t t T T t=1 t=1 Let w = w∗ = arg minw∈W f (w). By taking minimization over λ ≥ 0 on left hand side and considering η = γ/(2G22 ), we have ∑ T T T ∑ ∑ [ Tt=1 g(wt )]2+ G22 (G21 + C22 ) γ ∑ (f (wt ) − f (w∗ )) + ≤ + γT + 2 ∆t + ζt (w∗ ) 2 /γ) 2 γ 2(γT + 2G G G 2 2 2 t=1 t=1 t=1 280 9.6.3 Proof of Lemma 9.13 Since F (w) is strongly convex in w, we have F (w) − F (wt ) ≥ ⟨w − wt , ∇F (wt )⟩ + β ∥w − wt ∥22 . 2 Following the same analysis as in Lemma 9.10, we have 1 (wt − w)⊤ ∇F (wt ) ≤ 2ηt ( ∥w − wt ∥22 − ∥w − wt+1 ∥22 ) + ηt ∥g(wt , ξt ) + p(wt )λ0 ∇g(wt )∥22 2 β + ζt (w) − ∥w − wt ∥22 2 ) ( 1 ∥w − wt ∥22 − ∥w − wt+1 ∥22 ≤ 2ηt β + ηt G21 + ηt λ20 G22 + ζt (w) − ∥w − wt ∥22 , 2 where p(w) = exp (λ0 g(w)/γ) . 1 + exp (λ0 g(w)/γ) Taking summation of above inequality over t = 1, · · · , T gives T ∑ F (wt ) − F (w) ≤ t=1 + ( T ∑ 1 1 t=1 T ∑ 2 1 β − − ηt ηt−1 2 ηt (G21 + λ20 G22 ) + t=1 ) T ∑ t=1 Since ηt = 1/(2βt), we have 281 ∥w − wt ∥22 ζt (w) − T β∑ ∥w − wt ∥22 . 4 t=1 T T T ∑ (G21 + λ20 G22 )(1 + ln T ) ∑ β∑ (F (wt ) − F (w)) ≤ + ζt (w) − ∥w − wt ∥22 2β 4 t=1 t=1 t=1 We complete the proof by letting w = w∗ = arg minw∈W f (w). 9.6.4 Proof of Lemma 9.14 The proof is based on the Berstein’s inequality for martingales (see Theorem A.25). To do so, define the martingale difference Xt = ⟨w − wt , ∇f (wt ) − g(wt , ξt )⟩ and martingale ΛT = ∑T 2 t=1 Xt . Define the conditional variance ΣT as Σ2T = T ∑ t=1 T [ ] ∑ ∥wt − w∥22 = 4G21 DT . Eξt Xt2 ≤ 4G21 t=1 Define K = 4G1 . We have ( ) √ √ 2 Pr ΛT ≥ 2 4G1 DT τ + 2Kτ /3 ( ) √ √ 2 2 2 = Pr ΛT ≥ 2 4G1 DT τ + 2Kτ /3, ΣT ≤ 4G1 DT ( ) √ √ 4 2 2 2 = Pr ΛT ≥ 2 4G1 DT τ + 2Kτ /3, ΣT ≤ 4G1 DT , DT ≤ T ) ( m √ ∑ √ 4 4 i−1 i 2 2 < DT ≤ 2 + Pr ΛT ≥ 2 4G21 DT τ + 2Kτ /3, ΣT ≤ 4G1 DT , 2 T T i=1 ) ( √ ( ) ∑ m √ 4 4 4 ≤ Pr DT ≤ + Pr ΛT ≥ 2 × 4G21 2i τ + 2Kτ /3, Σ2T ≤ 4G21 2i T T T i=1 ( ) 4 ≤ Pr DT ≤ + me−τ . T 282 where we use the fact ∥wt − w∥22 ≤ 4 for any w ∈ B, and the last step follows the Bernstein inequality for martingales. We complete the proof by setting τ = ln(m/δ). 9.7 Summary In this chapter, we made a progress towards making the SGD method efficient by proposing a framework in which it is possible to exclude the projection steps from the SGD algorithm. We have proposed two novel algorithms to overcome the computational bottleneck of the projection step in applying SGD to optimization problems with complex domains. We showed using novel theoretical analysis that the proposed algorithms can achieve an √ O(1/ T ) convergence rate for general convex functions and an O(ln T /T ) rate for strongly convex functions with a overwhelming probability which are known to be optimal (up to a logarithmic factor) for stochastic optimization. We have also addressed the problem of online convex optimization with constraints, where we only need the constraints to be satisfied in the long run. In addition to the regret bound which is the main tool in analyzing the performance of general online convex optimization algorithms, we defined the bound on the violation of constraints in the long term which measures the cumulative violation of the solutions from the constraints for all rounds. Our setting is applied to solving online convex optimization without projecting the solutions onto the complex convex domain at each iteration, which may be computationally inefficient for complex domains. Our strategy is to turn the problem into an online convex-concave optimization problem and apply online gradient descent algorithm to solve it. 283 9.8 Bibliographic Notes Generally, the computational complexity of the projection step in SGD has seldom been taken into account in the literature. Here, we briefly review the previous works on projection free convex optimization, which is closely related to the theme of this study. For some specific domains, efficient algorithms have been developed to circumvent the high computational cost caused by projection step at each iteration of gradient descent methods. The main idea is to select an appropriate direction to take from the current solution such that the next solution is guaranteed to stay within the domain. Clarkson [42] proposed a sparse greedy approximation algorithm for convex optimization over a simplex domain, which is a generalization of an old algorithm by Frank and Wolfe [56] (a.k.a conditional gradient descent [21]). Zhang [155] introduced a similar sequential greedy approximation algorithm for certain convex optimization problems over a domain given by a convex hull. Hazan [67] devised an algorithm for approximately maximizing a concave function over a trace norm bounded PSD cone, which only needs to compute the maximum eigenvalue and the corresponding eigenvector of a symmetric matrix. Ying et al. [151] formulated the distance metric learning problems into eigenvalue maximization and proposed an algorithm similar to [67]. Recently, Jaggi [77] put these ideas into a general framework for convex optimization with a general convex domain. Instead of projecting the intermediate solution into a complex convex domain, Jaggi’s algorithm solves a linearized problem over the same domain. He showed that Clark’s algorithm , Zhang’s algorithm and Hazan’s algorithm discussed above are special cases of his general algorithm for special domains. It is important to note that all these algorithms are designed for batch optimization, not for stochastic optimization, which is the focus of this chapter. 284 The proposed stochastic optimization methods with only one projection are closely related to the online Frank-Wolfe algorithm proposed in [73]. It is a projection free online learning algorithm, built on the the assumption that it is possible to efficiently minimize a linear function over the complex domain. Indeed, the FW algorithm replaces the projection into the convex domain with a linear programming over the same domain which makes sense if solving the linear programming is cheaper than the projection. One main shortcoming of the OFW algorithm is that its convergence rate for general stochastic optimization is O(T −1/3 ), significantly slower than that of a standard stochastic gradient descent algorithm (i.e., O(T −1/2 )). It achieves a convergence rate of O(T −1/2 ) only when the objective function is smooth, which unfortunately does not hold for many machine learning problems where either a non-smooth regularizer or a non-smooth loss function is used. Another limitation of OFW is that it assumes a linear optimization problem over the domain W can be solved efficiently. Although this assumption holds for some specific domains as discussed in [73], but in many settings of practical interest, this may not be true. The proposed algorithms address the two limitations explicitly. In particular, we show that how two seemingly different modifications of the SGD can be used to avoid performing expensive projections with similar convergency rates as the original SGD method. The proposed online optimization with soft constraints setup is reminiscent of regret minimization with side constraints or constrained regret minimization addressed in [110], motivated by applications in wireless communication. In regret minimization with side constraints, beyond minimizing regret, the learner has some side constraints that need to be satisfied on average for all rounds. Unlike our setting, in learning with side constraints, the set W is controlled by adversary and can vary arbitrarily from trial to trial. It has been shown that if the convex set is affected by both decisions and loss functions, the minimax 285 optimal regret is generally unattainable online [111]. 286 APPENDIX 287 Appendix A Technical Background A.1 Convex Analysis In this appendix, we introduce the basic definitions and results of convex analysis [121, 34, 28] needed for the analysis of the algorithms in this thesis. We begin by formalizing a few topological definitions we used throughout the thesis followed by few definitions and results from convexity. Definition A.1 (Euclidean Ball). A Euclidean ball with radius r centered at point w0 is B(w0 , r) = {w ∈ Rd ; ∥w − w0 ∥ ≤ r}. An ℓp ball is also defined equivalently but now the distance is defined by the ℓp norm as ∥w∥p = ( ∑d p 1/p . i=1 |wi | ) Definition A.2 (Boundary and Interior). For a given set W, we say that w is on the boundary of W, denoted by ∂W, if for some small ϵ > 0, the ball centered at w with radius ¯ ̸= ∅, ϵ covers both inside and outside of the set, i.e., B(w0 , ϵ) ∩ W = ̸ ∅ and B(w0 , ϵ) ∩ W ¯ is the complement of set W. We say that a point w is in the interior of set W if where W ∃ϵ > 0 s.t. B(w0 , ϵ) ⊂ W. Definition A.3 (Convex Set). A set W in a vector space is convex if for any two vectors w, w′ ∈ W, the line segment connecting two points is contained in W as well. In other fords, for any λ ∈ [0, 1], we have that λw + (1 − λ)w′ ∈ W. 288 Intuitively, a set is convex if its surface has no ”dips”. The intersection of an arbitrary family of convex sets is obviously convex. Definition A.4 (Projection onto Convex Sets). Let W ⊆ Rd be a closed convex set and let w ∈ Rd be a point. The projection of w onto W is denoted by ΠW (w) and defined by ΠW (w) = arg min ∥w − w′ ∥. w′ ∈W In particular, for every w ∈ Rd there exists a point w∗ which satisfies this and it is unique. Here are few examples of projections. For the positive semidefinite (PSD) cone, i.e., Sd+ = {X ∈ Rd×d : X ⪰ 0}, the projection requires a full eigendecomposition of the input matrix. Precisely, let W ∈ Sd+ has the eigendecomposition W = U⊤ ΣU where Σ = diag(λ1 , λ2 , · · · , λd ) and U is an orthogonal matrix. Then, Π d (W) = U⊤ Σ+ U where S + d Σ+ = diag([λ1 ]+ , [λ2 ]+ , · · · , [λd ]+ ). For Hyperplane H = {x ∈ R : ⟨a, x⟩ = b}, a ̸= 0, the ⟨a,w⟩−b a. For Affine set A = {x ∈ Rd : Ax = b}, projection is equivalent to ΠH (w) = w − ∥a∥2 ( ) the projection can be obtained by ΠA (w) = w −A⊤ AA−1 (Aw −b). For Polyhedron set, P = {x ∈ Rd : Ax ≤ b} there is no analytical solution and the projection ΠP (w) is an offline optimization by itself where the corresponding optimization problem arg minw′ ∈P ∥w′ − w∥ must be solved approximately. Lemma A.5 (Non-expensiveness of Projection). Let W ⊆ Rd be a convex set and consider the projection operation defined as above. Then, ∥ΠW (w) − ΠW (w′ )∥ ≤ ∥w − w′ ∥, ∀w, w′ ∈ Rd Definition A.6 (Convex Function). A function f : W → R is said to be convex if W is 289 convex and for every w, w′ ∈ W and α ∈ [0, 1], f (λw + (1 − λ)w′ ) ≤ λf (w) + (1 − λ)f (w′ ). A continuously differentiable function is convex if f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ for all w, w′ ∈ W. If f is non-smooth then this inequality holds for any sub-gradient g ∈ ∂f (w′ ). Definition A.7 (Subgradient). A subgradient of a convex function f : Rd → R at some point w is any vector g ∈ Rd that achieves the same lower bound as the tangent line to function f at point w, i.e., f (w′ ) ≥ f (w) + ⟨g, w′ − w⟩, ∀w′ ∈ Rd The subgradient g always exists for convex functions on the relative interior of their domain. Furthermore, if f is differentiable at w, then there is a unique subgradient g = ∇f (w). Note that subgradients need not exist for non-convex functions. Definition A.8 (Subdifferential). The subdifferential of a convex function f : Rd → R at some point w is the set of all subgradients of f at w , i.e., ∂f (w) = {g ∈ Rd : ∀w′ , f (w′ ) ≥ f (w) + ⟨g, w′ − w⟩} An important property of subdifferential set of a function ∂f is that it is closed and convex set, even for non-convex functions which is straightforward to verify by following the definition. Moreover, the subdifferential is also always nonempty for convex functions. This is a consequence of the supporting hyperplane theorem, which states that at any point on the boundary of a convex set, there exists at least one supporting hyperplane. Since the epigraph 290 of a convex function is a convex set, we can apply the supporting hyperplane theorem to the set of points (w; f (w)), which are exactly the boundary points of the epigraph. The subdifferential of a differentiable function contains only single element which is the gradient of function at that point, i.e., ∂f (w) = {∇f (w)}. We are now prepared to introduce the concept of Lipschitz continuity, designed to measure change of function values versus change in the independent variable for a general function f (·). Definition A.9 (Lipschitzness). A function f : W → R is ρ-Lipschitz over the set W if for every w, w′ ∈ W we have that |f (w) − f (w′ )| ≤ ρ||w − w′ ||. The following lemma shows a result on the Lipschitz continuity of a function which is defined as the maximum of other Lipschitz continuous functions. Lemma A.10. Assume that functions fi : W → R, i ∈ [m] are Lipschitz continuous with constant ρ. Then, function f (w) = maxi∈[m] fi (w) is Lipschitz continuous with constant ρ, that is, |f (w) − f (w′ )| ≤ ρ∥w − w′ ∥ for any w, w′ ∈ W. ∑ Proof. First, we rewrite f (w) = maxi∈[m] fi (w) as f (w) = maxα∈∆m m i=1 αi fi (w) where ∑m ∆m is the m-simplex, that is, ∆m = {α ∈ Rm + ; i=1 αi = 1}. Then, we have |f (w) − f (w′ )| = ≤ ≤ max m ∑ α∈∆m i=1 m ∑ max α∈∆m i=1 m ∑ max α∈∆m i=1 αi fi (w) − max αi fi (w) − m ∑ α∈∆m i=1 m ∑ αi fi (w′ ) αi fi (w′ ) i=1 αi fi (w) − fi (w′ ) ≤ ρ∥w − w′ ∥, 291 where the last inequality follows from the Lipschitz continuity of fi (w), i ∈ [m]. Definition A.11 (Smoothness). A differentiable function f : W → R is said to be β-smooth with respect to a norm ∥ · ∥, if it holds that f (w) ≤ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ + β ∥w − w′ ∥2 , ∀ w, w′ ∈ W. 2 (A.1) The following result is an important property of smooth functions which has been utilized in the proof of few results in the thesis. Lemma A.12 (Self-bounding Property). For any β-smooth non-negative function f : R → √ R, we have |f ′ (w)| ≤ 4βf (w) As a simple proof, first from the smoothness assumption, by setting w1 = w2 − β1 f ′ (w2 ) in 1 |f ′ (w )|2 . On the other hand, (A.1) and rearranging the terms we obtain f (w2 )−f (w1 ) ≥ 2β 2 from the convexity of loss function we have f (w1 ) ≥ f ′ (w2 ) + ⟨f ′ (w1 ), w1 − w2 ⟩. Combining these inequalities and considering the fact that the function is non-negative gives the desired inequality. Definition A.13 (Strong Convexity). A function f (w) is said to be α-strongly convex w.r.t a norm ∥ · ∥, if there exists a constant α > 0 (often called the modulus of strong convexity) such that, for any λ ∈ [0, 1] and for all w, w′ ∈ W, it holds that 1 f (λw + (1 − λ)w′ ) ≤ αf (w) + (1 − λ)f (w′ ) − λ(1 − λ)α∥w − w′ ∥2 . 2 If f (w) is twice differentiable, then an equivalent definition of strong convexity is ∇2 f (w) ⪰ αI which indicates that the smallest eigenvalue of the Hessian of f (w) is uniformly lower 292 bounded by α everywhere. When f (w) is differentiable, the strong convexity is equivalent to f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ + α ∥w − w′ ∥2 , ∀ w, w′ ∈ W. 2 An important property of strongly convex functions that we used in the proof of few results in the following: Lemma A.14. Let f (w) be a α-strongly convex function over the domain W, and w∗ = arg minw∈W f (w). Then, for any w ∈ W, we have f (w) − f (w∗ ) ≥ α ∥w − w∗ ∥2 . 2 (A.2) Definition A.15 (Dual Norm). Let ∥ · ∥ be any norm on Rd . Its dual norm denoted by ∥ · ∥∗ is defined by ∥w′ ∥∗ = sup ⟨w′ , w⟩ − ∥w∥ w∈Rd An equivalent definition is ∥w′ ∥∗ = supw∈Rd {⟨w, w′ ⟩|∥w∥ ≤ 1}. If p, q ∈ [1, ∞] satisfy 1/p + 1/q = 1, then the ℓp and ℓq norms are dual to each other. Definition A.16 (Fenchel Conjugate 1 ). The Fenchel conjugate of a function f : W → R is defined as f ∗ (v) = sup ⟨w, v⟩ − f (w). w∈W An immediate result of above definition is the Fenchel-Young inequality stating that for any w and v we have that ⟨w, v⟩ ≤ f (w) + f ∗ (v). Definition A.17 (Bregman Divergence). Let Φ : W → R be a continuously-differentiable 1 Also known as the convex conjugate or Legendre-Fenchel transformation 293 real-valued and strictly convex function defined on a closed convex set W. The Bregman divergence between w and w′ is the difference at w between f and a linear approximation around w′ BΦ (w, w′ ) = Φ(w) − Φ(w′ ) − ⟨w − w′ , ∇Φ(w′ )⟩. For example the squared Euclidean distance B(w, w′ ) = ∥w − w′ ∥2 is the canonical example of a Bregman distance, generated by the convex function Φ(w) = ∥w∥2 . The entropy function Φ(p) = ∑ divergence as BKL (p, q) = A.2 i pi log pi − ∑ ∑ pi gives rises to the generalized Kullback-Leibler ∑ ∑ p pi log qi − pi + qi . i Concentration Inequalities This appendix is a quick excursion into concentration of measure. We consider some important concentration results used in the proofs given in this thesis. Concentration inequalities give probability bounds for a random variable to be concentrated around its mean (e.g., see [49] , [30] and [31] for a through discussion and derivation of these inequalities). Lemma A.18 (Markov’s Inequality). For a non-negative random variable X and t > 0, P[X ≥ t] ≤ E[X] t The upside of Markov’s inequality is that it does not need almost no assumptions about the random variable, but the downside is that it only gives very weak bounds. Lemma A.19 (Hoeffding’s Lemma). Let X be any real-valued random variable with expected 294 value E[X] = 0 and such that X ∈ [a, b] almost surely. Then, for all λ ∈ R, [ E eλX ] ) ( 2 λ (b − a)2 ≤ exp . 8 Theorem A.20 (Hoeffding’s Inequality). Let X1 , · · · , Xn be a sequence of i.i.d random variables. Assume Xi ∈ [ai , bi ] and let S = X1 + X2 + · · · + Xn . Then, ( P(S − E[S] ≥ t) ≤ exp − ∑n 2t2 2 i=1 (bi − ai ) ( P(|S − E[S]| ≥ t) ≤ 2 exp − ∑n ) 2t2 2 i=1 (bi − ai ) , ) . The following result extends Hoeffding’s inequality to more general functions f (x1 , x2 , · · · , xn ). Theorem A.21 (McDiarmid’s Inequality). Let X1 , · · · , Xn be independent real-valued random variables. Suppose that sup x1 ,x2 ,··· ,xn ,x′i f (x1 , · · · , xi−1 , xi , xi+1 , · · · , xn ) − f (x1 , · · · , xi−1 , x′i , xi+1 , · · · , xn ) ≤ ci , for i = 1, 2, · · · , n. Then, ( 2t2 ) P (|f (X1 , X2 , · · · , Xn ) − E [f (X1 , X2 , · · · , Xn )]| ≥ t) ≤ 2 exp − ∑n 2 i=1 ci Hoeffding’s inequality does not use any information about the random variables except 295 the fact that they are bounded. If the variance of Xi , i ∈ [n] is small, then we can get a sharper inequality from Bennett’s inequality and its simplified version in Bernstein’s Inequality. Theorem A.22 (Bennett’s Inequality). Let X1 , · · · , Xn be a sequence of i.i.d random variables. Assume E[Xi ] = 0, E[Xi2 ] = σ 2 , and |Xi | ≤ M, i ∈ [n]. Then, P ( n ∑ ) Xi ≥ t ( ≤ exp i=1 −nσ 2 ϕ M2 ( tM nσ 2 )) , where ϕ(x) = (1 + x) log(1 + x) − x. Theorem A.23 (Bernstein’s Inequality). Let X1 , . . . , Xn be independent zero-mean E[Xi ] = 0, i ∈ [n] random variables. Suppose that |Xi | ≤ M almost surely, for all i ∈ [n]. Then, for all positive t, P ( n ∑ i=1 ) Xi > t   t2 ) . ≤ exp − (∑ [ ] 1 2 2 E Xj + 3 M t The next result is an extension of Hoeffding’s inequality to martingales with zero-mean and bounded increments which is due to [10]. Before stating the Hoeffding-Azuma’s inequality, we need few definitions. A sequence of random variables (Xi )i∈N on a probability space (Ω, A, P) is a martingale difference sequence with respect to a filtration (Fi )i∈N if and only if, or all i ≥ 1, each random variable Xi is Fi -measurable and satisfies, E[Xi |Fi−1 ] = 0. Theorem A.24 (Hoeffding-Azuma Inequality). Let X1 , X2 , . . . be a martingale difference sequence with respect to a filtration F = (Fi )i∈N . Assume that for all i ≥ 1, there exists a 296 Fi−1 -measurable random variable Ai and a non-negative constant ci such that Xi ∈ [Ai , Ai + ci ]. Then, the martingale Sn defined by Sn = ∑n i=1 Xi , satisfies for any t > 0, ( 2t2 P[Sn > t] ≤ exp − ∑n 2 i=1 ci ) . The Hoeffding-Azuma’s inequality indicates that Sn is sub-gaussian with variance i.e., ∑n 2 i=1 ci /4, ( ) ] −2t2 P max Si ≥ t ≤ exp ∑n 2 . 1≤i≤n i=1 ci [ The following inequality which is due to Freedman [57] extends Bernstein’s result to the case of discrete-time martingales with bounded jumps (a.k.a Freedman’s inequality). This result demonstrates that a martingale exhibits normal-type concentration near its mean value on a scale determined by the predictable quadratic variation of the sequence. Theorem A.25 (Bernstein’s Inequality for Martingales). Let X1 , . . . , Xn be a bounded martingale difference sequence with respect to the filtration F = (Fi )1≤i≤n and with |Xi | ≤ M . Let Si = i ∑ Xj j=1 be the associated martingale. Denote the sum of the conditional variances by Σ2n = n [ ] ∑ E Xi2 |Fi−1 , i=1 Then for all constants t, ν > 0, ] [ P max Si > t and Σ2n ≤ ν i=1,...,n 297 ( ≤ exp − t2 2(ν + M t/3) ) , and therefore, [ P A.3 max Si > i=1,...,n √ ] √ 2 2νt + M t and Σ2n ≤ ν ≤ e−t . 3 Miscellanea A.3.1 Extra-gradient Lemma Lemma A.26. Let Φ(z) be a α-strongly convex function with respect to the norm ∥·∥, whose dual norm is denoted by ∥ · ∥∗ , and B(w, z) = Φ(w) − (Φ(z) + (w − z)⊤ Φ′ (z)) be the Bregman distance induced by function Φ(w). Let Z be a convex compact set, and U ⊆ Z be convex and closed. Let z ∈ Z, γ > 0, Consider the points, w = arg min γu⊤ ξ + B(u, z), (A.3) z+ = arg min γu⊤ ζ + B(u, z), (A.4) u∈U u∈U then for any u ∈ U , we have γζ ⊤ (w − u) ≤ B(u, z) − B(u, z+ ) + γ2 α ∥ξ − ζ∥2∗ − [∥w − z∥2 + ∥w − z+ ∥2 ]. α 2 Proof. By using the definition of Bregman distance B(u, z), we can write two updates in (A.3) as 298 w = arg min u⊤ (γξ − Φ′ (z)) + Φ(u), u∈U (A.5) z+ = arg min u⊤ (γζ − Φ′ (z)) + Φ(u), u∈U by the first oder optimality condition, we have (u − w)⊤ (γξ − Φ′ (z) + Φ′ (w)) ≥ 0, ∀u ∈ U, (A.6) (u − z+ )⊤ (γζ − Φ′ (z) + Φ′ (z+ )) ≥ 0, ∀u ∈ U Applying (A.6) with u = z+ and u = w, we get γ(w − z+ )⊤ ξ ≤ (Φ′ (z) − Φ′ (w))⊤ (w − z+ ), (A.7) γ(z+ − w)⊤ ζ ≤ (Φ′ (z) − Φ′ (z+ ))⊤ (z+ − w). Summing up the two inequalities, we have γ(w − z+ )⊤ (ξ − ζ) ≤ (Φ′ (z+ ) − Φ′ (w))⊤ (w − z+ ). (A.8) Then γ∥ξ − ζ∥∗ ∥w − z+ ∥ ≥ −γ(w − z+ )⊤ (ξ − ζ) ≥ (Φ′ (z+ ) − Φ′ (w))⊤ (z+ − w) ≥ α∥z+ − w∥2 . where in the last inequality, we use the strong convexity of Φ(w). 299 (A.9) B(u, z) − B(u, z+ ) = Φ(z+ ) − Φ(z) + (u − z+ )⊤ Φ′ (z+ ) − (u − z)⊤ Φ′ (z) =Φ(z+ ) − Φ(z) + (u − z+ )⊤ Φ′ (z+ ) − (u − z+ )⊤ Φ′ (z) − (z+ − z)⊤ Φ′ (z) =Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) + (u − z+ )⊤ (Φ′ (z+ ) − Φ′ (z)) =Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) + (u − z+ )⊤ (γζ + Φ′ (z+ ) − Φ′ (z)) − (u − z+ )⊤ γζ ≥Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (u − z+ )⊤ γζ = Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γζ +(w − u)⊤ γζ, ϵ (A.10) where the inequality follows from (A.6). We proceed by bounding ϵ as: ϵ =Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γζ =Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) − (w − z+ )⊤ γξ =Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) + (z+ − w)⊤ (γξ − Φ′ (z) + Φ′ (w)) − (z+ − w)⊤ (Φ′ (w) − Φ′ (z)) ≥Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) − (z+ − w)⊤ (Φ′ (w) − Φ′ (z)) =Φ(z+ ) − Φ(z) − (w − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) − (z+ − w)⊤ Φ′ (w) [ ] [ ] ⊤ ′ ⊤ ′ = Φ(z+ ) − Φ(w) − (z+ − w) Φ (w) + Φ(w) − Φ(z) − (w − z) Φ (z) −(w − z+ )⊤ γ(ζ − ξ) α α ≥ ∥w − z+ ∥2 + ∥w − z∥2 − γ∥w − z+ ∥∥ζ − ξ∥∗ 2 2 α γ2 ≥ {∥w − z+ ∥2 + ∥w − z∥2 } − ∥ζ − ξ∥2∗ , 2 α 300 (A.11) where the first inequality follows from (A.6), the second inequality follows from the strong convexity of Φ(w), and the last inequality follows from (A.10). Combining the above results, we have γ(w − u)⊤ ζ ≤ B(u, z) − B(u, z+ ) + A.3.2 γ2 α ∥ζ − ξ∥2∗ − {∥w − z+ ∥2 + ∥w − z∥2 }. α 2 (A.12) Stochastic Mirror Descent for Smooth Losses In this subsection we provide a tight analysis of stochastic mirror decent algorithm for smooth losses. We will show that the convergence rate of the mirror descent algorithm [119] for stochastic optimization of both non-smooth and smooth functions is dominated by an √ O(1/ T ) convergence rate. We consider the nonlinear projected subgradient formulation of the mirror descent from [17] which iteratively updates the solution by: { } 1 wt+1 = arg min ⟨w, gt ⟩ + BΦ (w, wt ) , ηt w∈W (A.13) where gt is a (sub)gradient of f (w) at point wt and BΦ (·, ·) is the Bregmen diveregnce defined with strongly convex function Φ(·). Define the average solution as wT = ∑T t=1 wt /T and let g(w, ξ) denote a stochastic gradient of function at point w, i.e., E[g(w, ξ)] = ∇f (w). We state few standard assumptions we make in our analysis about the stochastic optimization problem. A1. (Lipschitz Continuity) E[∥g(w, ξ)∥2∗ ] ≤ G2 301 A2. (Bounded Variance) E[∥∇f (w) − g(w, ξ)∥] ≤ σ 2 . A3. (Smoothness) The function f has L Lipschitz continuous gradient. A4. (Compactness) BΦ (w∗ , w) ≤ R2 . The proof is extensively uses the following key inequalities: (1) Optimality condition ⟨w − wt+1 , gt + 1 1 ∇Φ(wt+1 ) − ∇Φ(wt )⟩ ≥ 0 ηt ηt (2) Generalized inequality BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 ) − BΦ (wt+1 , wt ) = ⟨w∗ − wt+1 , ∇Φ(wt+1 ) − ∇Φ(wt )⟩ (3) Fenchel-Yang inequality applied to the conjugate pair 12 ∥ · ∥2 and 12 ∥ · ∥2∗ yields ⟨gt , wt − wt+1 ⟩ ≤ 1 ηt ∥gt ∥2∗ + ∥wt − wt+1 ∥2 2 2ηt (4) 1 BΦ (w, y) ≥ ∥w − y∥2 2 (5) From smoothness assumption we have f (w′ ) − f (w) − ⟨∇f (w), w′ − w⟩ ≤ L ′ ∥w − w∥2 2 The analysis of the convergence rate of basic mirror descent algorithm follows standard techniques in convex optimization and for completeness we provide a proof here following [17]. 302 Lemma A.27. We have f (wt ) − f (w∗ ) ≤ 1 ηt ||gt ||2∗ + [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )] 2 ηt Proof. From the convexity of f (w) and by setting ζt = ∇f (wt ) − gt we have f (wt ) − f (w∗ ) ≤ ⟨∇f (wt ), wt − w∗ ⟩ ≤ ⟨gt , wt − w∗ ⟩ + ⟨∇f (wt ) − gt , wt − w∗ ⟩ (1) ≤ ⟨gt , wt − w∗ ⟩ + ⟨w∗ − wt+1 , gt + = ⟨gt , wt − wt+1 ⟩ + (2) η t ≤ 2 ∥gt ∥2∗ + 1 1 ∇Φ(wt+1 ) − ∇Φ(wt )⟩ + ⟨ζt , wt − w∗ ⟩ ηt ηt 1 ⟨w∗ − wt+1 , ∇Φ(wt+1 ) − ∇Φ(wt )⟩ + ⟨ζt , wt − w∗ ⟩ ηt 1 1 ∥wt − wt+1 ∥2 + [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 ) 2ηt ηt − BΦ (wt+1 , wt )] + ⟨ζt , wt − w∗ ⟩ (3) η t ≤ 2 ∥gt ∥2∗ + 1 [B (w∗ , wt ) − BΦ (w∗ , wt+1 )] + ⟨ζt , wt − w∗ ⟩ ηt Φ Theorem A.28. By setting leaning rate as ηt = R √ we have the following rate for the G t convergence of stochastic mirror descent for non-smooth objectives: RG E[f (wT )] − f (w∗ ) ≤ O( √ ) T 303 Proof. Summing up Lemma 1 for all iterations we get T ∑ t=1 ∑ 1 1 1 f (wt ) − f (w∗ ) ≤ BΦ (w∗ , w1 ) + BΦ (w∗ , wt )( − ) η1 ηt ηt−1 T t=2 + T ∑ ηt t=1 ≤ 2 ∥gt ∥2∗ + T ∑ ⟨ζt , wt − w∗ ⟩ t=1 T R 2 G2 ∑ + ηt ηT 2 t=1 √ = O(RG T ), where the second inequality follows since each wt is a deterministic function of ξ1 , · · · , ξt−1 , so E[⟨∇f (wt ) − gt , wt − w∗ ⟩|ξ1 , · · · , ξt−1 ] = 0 We generalize the proof for smooth functions. √ 1 Theorem A.29. By setting ηt = L+β where βt = (σ/R) t, the following holds for the t convergence rate of stochastic mirror descent for smooth functions: E[f (wT )] − f (w∗ ) ≤ O( LR2 σR +√ ) T T Proof. f (wt ) − f (w∗ ) ≤ ⟨∇f (wt ), wt − w∗ ⟩ ≤ ⟨∇f (wt ), wt − wt+1 ⟩ + ⟨∇f (wt ), wt+1 − w∗ ⟩ (5) ≤ f (wt ) − f (wt+1 ) + L ∥wt+1 − wt ∥2 + ⟨∇f (wt ), wt+1 − w∗ ⟩ 2 304 By rearranging the terms we proceed as follows: f (wt+1 ) − f (w∗ ) ≤ ⟨∇f (wt ), wt+1 − w∗ ⟩ + L ∥wt+1 − wt ∥2 2 L ∥wt+1 − wt ∥2 + ⟨ζt , wt+1 − w∗ ⟩ 2 (1) 1 L ≤ ⟨w∗ − wt+1 , ∇Φ(wt+1 ) − ∇Φ(wt )⟩ + ∥wt+1 − wt ∥2 + ⟨ζt , wt+1 − w∗ ⟩ ηt 2 ≤ ⟨gt , wt+1 − w∗ ⟩ + (2) 1 ≤ ηt (4) 1 ≤ ηt [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 ) − BΦ (wt , wt+1 )] + L ∥wt+1 − wt ∥2 + ⟨ζt , wt+1 − w∗ ⟩ 2 [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )] − (L + βt )BΦ (wt , wt+1 ) + LBΦ (wt , wt+1 ) + ⟨ζt , wt+1 − w∗ ⟩ = 1 [B (w∗ , wt ) − BΦ (w∗ , wt+1 )] − βt BΦ (wt , wt+1 ) + ⟨ζt , wt+1 − wt ⟩ + ⟨ζt , wt − w∗ ⟩ ηt Φ (3) 1 ≤ ηt [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )] + 1 ∥ζt ∥2∗ + ⟨ζt , wt − w∗ ⟩ 2βt Summing for all t yields T ∑ f (wt+1 ) − f (w∗ ) t=1 ∑ ∑ ∥ζt ∥2 ∑ 1 1 1 ∗ ≤ BΦ (w∗ , w1 ) + BΦ (w∗ − wt )( − )+ + ⟨ζt , wt − w∗ ⟩ η1 ηt ηt−1 2βt T T T t=2 t=1 t=1 T T ∑ R2 1 1 σ2 ∑ 1 2 ( − ≤ +R )+ +0 η1 ηt ηt−1 2 βt t=2 ≤ R2 ηT + t=1 √ σR √ T = O(LR2 + σR T ), 2 where in the second in equality we used the fact E[⟨∇f (wt )−gt , wt −w∗ ⟩|ξ1 , · · · , ξt−1 ] = 0. By averaging the solutions over all iterations and applying the Jensen’s inequality we obtain the convergence rate claimed in the theorem. 305 306 BIBLIOGRAPHY 307 BIBLIOGRAPHY [1] Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2008. [2] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In COLT, pages 263–274, 2008. [3] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Interior-point methods for fullinformation and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175, 2012. [4] Alekh Agarwal, Peter L. Bartlett, Pradeep D. Ravikumar, and Martin J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012. [5] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of The 23rd Conference on Learning Theory (COLT), pages 28–40, 2010. [6] Amit Agarwal, Elad Hazan, Satyen Kale, and Robert E. Schapire. Algorithms for portfolio management based on the newton method. In Proceedings of the 23rd international conference on Machine learning, pages 9–16, 2006. [7] M. Anthony and P.L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999. [8] Sanjeev Arora, L´aszl´o Babai, Jacques Stern, and Z Sweedyk. The hardness of approximate optima in lattices, codes, and systems of linear equations. In Foundations of Computer Science, 1993. Proceedings., 34th Annual Symposium on, pages 724–733. IEEE, 1993. [9] Baruch Awerbuch and Robert D. Kleinberg. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, STOC ’04, pages 45–53, New York, NY, USA, 2004. ACM. 308 [10] Kazuoki Azuma et al. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357–367, 1967. [11] K. Bache and M. Lichman. UCI machine learning repository, 2013. [12] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80(2-3):111–139, 2010. [13] Peter L Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005. [14] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. [15] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003. [16] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988. [17] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167–175, 2003. [18] Robert Bell and Thomas M Cover. Game-theoretic optimal portfolios. Management Science, 34(6):724–733, 1988. [19] Shai Ben-David, David Loker, Nathan Srebro, and Karthik Sridharan. Minimizing the misclassification error rate using a surrogate convex loss. In ICML, 2012. [20] Shai Ben-David, D´avid P´al, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, 2009. [21] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. [22] Dimitri P. Bertsekas, Angelia Nedic, and Asuman E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. [23] David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 1(6), 1956. 309 [24] David Blackwell. Minimax vs. bayes prediction. Probability in the Engineering and Informational Sciences, 9(01):53–58, 1995. [25] HD Block. The perceptron: A model for brain functioning. i. Reviews of Modern Physics, 34(1):123, 1962. [26] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the vapnik-chervonenkis dimension. J. ACM, 36(4):929–965, 1989. [27] Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis. Cambridge University Press, 1998. [28] Jonathan M Borwein and Adrian S Lewis. Convex analysis and nonlinear optimization: theory and examples, volume 3. Springer, 2010. [29] Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In NIPS, pages 161–168, 2008. [30] St´ephane Boucheron, G´abor Lugosi, and Olivier Bousquet. Concentration inequalities. In Advanced Lectures on Machine Learning, pages 208–240. Springer, 2004. [31] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013. [32] Olivier Bousquet and Andr´e Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002. [33] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [34] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [35] Augustin Cauchy. M´ethode g´en´erale pour la r´esolution des systemes d´equations simultan´ees. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847. [36] Nicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050– 2057, 2004. 310 [37] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485, 1997. [38] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [39] Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging algorithm. In Advances in neural information processing systems, pages 297–305, 2009. [40] Chao-Kai Chiang, Chia-Jung Lee, and Chi-Jen Lu. Beating bandits in gradually evolving worlds. In COLT, pages 210–227, 2013. [41] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, 2012. [42] Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms, 6(4), 2010. [43] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. [44] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, pages 1647–1655, 2011. [45] Thomas M. Cover. Behavior of sequential predictors of binary sequences. Transactions of the Prague Conferences on Information Theory, 1965. [46] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information for online optimization. In NIPS, 2007. [47] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13:165–202, 2012. [48] Ofer Dekel and Yoram Singer. Data-driven online to batch conversions. In NIPS, 2005. [49] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009. 311 [50] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012. [51] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In ICML, pages 272–279, 2008. [52] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247–261, 1989. [53] Tim V Erven, Wouter M Koolen, Steven D Rooij, and Peter Gr¨ unwald. Adaptive hedge. In Advances in Neural Information Processing Systems, pages 1656–1664, 2011. [54] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385–394, 2005. [55] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the vapnikchervonenkis dimension. Machine Learning, 21(3):269–304, 1995. [56] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics, 3, 1956. [57] David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975. [58] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23– 37. Springer, 1995. [59] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79–103, 1999. [60] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012. [61] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000. 312 [62] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: a generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012. [63] Geoffrey J Gordon. Regret bounds for prediction problems. In Proceedings of the twelfth annual conference on Computational learning theory, pages 29–40. ACM, 1999. [64] John Greenstadt. On the relative efficiencies of gradient methods. Mathematics of Computation, pages 360–367, 1967. [65] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957. [66] Steve Hanneke. Theoretical Foundations of Active Learning. PhD thesis, 2009. [67] Elad Hazan. Sparse approximate solutions to semidefinite programs. In LATIN, pages 306–316, 2008. [68] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007. [69] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In COLT, pages 57–68, 2008. [70] Elad Hazan and Satyen Kale. Better algorithms for benign bandits. In SODA, pages 38–47, 2009. [71] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80(2-3):165–188, 2010. [72] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. COLT, 19:421–436, 2011. [73] Elad Hazan and Satyen Kale. Projection-free online learning. In ICML, 2012. [74] Elad Hazan and Tomer Koren. Optimal algorithms for ridge and lasso regression with partially observed attributes. CoRR, 2011. [75] Klaus-Uwe Hoffgen, Hans-Ulrich Simon, and Kevin S Vanhorn. Robust trainability of single neurons. Journal of Computer and System Sciences, 50(1):114–125, 1995. 313 [76] Anatoli Iouditski and Yuri Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. available at http://hal.archivesouvertes.fr/docs/00/50/89/33/PDF/Strong-hal.pdf, 2010. [77] Martin Jaggi. Sparse Convex Optimization Methods for Machine Learning. PhD thesis, ETH Zurich, October 2011. [78] Martin Jaggi and Marek Sulovsk´ y. A simple algorithm for nuclear norm regularized problems. In ICML, pages 471–478, 2010. [79] Kevin G Jamieson, Robert D Nowak, and Benjamin Recht. Query complexity of derivative-free optimization. In NIPS, 2012. [80] Wenxin Jiang. Process consistency for adaboost. The Annals of Statistics, 32(1):13–29, 2004. [81] Anatoli Juditsky and Yuri Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions. Technical report, 2010. [82] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS, pages 793–800, 2008. [83] Sham M. Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21, pages 801–808, 2008. [84] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008. [85] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009. [86] Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71:291–307, October 2005. [87] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17(2-3):115–141, 1994. [88] J. Kivinen, A. J. Smola, and R. C. Williamson. Online Learning with Kernels. IEEE Transactions on Signal Processing, 52:2165–2176, 2004. 314 [89] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. In Proceedings of the 27th annual ACM symposium on Theory of computing, Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, pages 209–218, New York, NY, USA, 1995. ACM. [90] Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997. [91] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. Information Theory, IEEE Transactions on, 47(5):1902–1914, 2001. [92] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in mathematics. Springer, 2011. [93] Guanghui Lan. Efficient methods for stochastic composite optimization. Technical report, 2008. [94] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012. [95] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. [96] Qihang Lin, Xi Chen, and Javier Pena. A smoothing stochastic gradient method for composite optimization. arXiv preprint arXiv:1008.5204, 2010. [97] Yi Lin. A note on margin-based loss functions in classification. Statistics & probability letters, 68(1):73–82, 2004. [98] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine learning, 2(4):285–318, 1988. [99] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994. [100] Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In ICML, pages 83–90, 2009. [101] G´abor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of regularized boosting methods. Annals of Statistics, pages 30–55, 2004. 315 [102] Mehrdad Mahdavi and Rong Jin. Passive learning with target risk. COLT, 2013. [103] Mehrdad Mahdavi and Rong Jin. Excess risk bounds for exponentially concave losses. arXiv preprint arXiv:1401.4566, 2014. [104] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: Online convex optimization with long term constraints. Journal of Machine Learning Research, 13:2503–2528, 2012. [105] Mehrdad Mahdavi, Tianbao Yang, and Rong Jin. Stochastic convex optimization with multiple objectives. In NIPS, 2013. [106] Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi. Stochastic gradient descent with only one projection. In NIPS, pages 503–511, 2012. [107] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In NIPS, 2013. [108] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Binary excess risk for smooth convex surrogates. arXiv preprint arXiv:1402.1792, 2014. [109] Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999. [110] Shie Mannor and John N. Tsitsiklis. Online learning with constraints. In COLT, pages 529–543, 2006. [111] Shie Mannor, John N. Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10:569–590, 2009. [112] David A McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh annual Conference on Computational learning theory, pages 230–234. ACM, 1998. [113] H. Brendan McMahan and Matthew J. Streeter. Open problem: Better bounds for online logistic regression. Journal of Machine Learning Research - Proceedings Track, 23:44.1–44.3, 2012. [114] Hariharan Narayanan and Alexander Rakhlin. Random walk approach to regret minimization. In NIPS, pages 1777–1785, 2010. 316 [115] Arkadai Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [116] Arkadi Nemirovski. Efficient methods in convex programming. Lecture Notes, Available at http://www2.isye.gatech.edu/ nemirovs, 1994. [117] Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. on Optimization, 15:229–251, 2005. [118] Arkadi S Nemirovski and Michael J Todd. Interior-point methods for optimization. Acta Numerica, 17:191–234, 2008. [119] Arkadi Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley- Interscience Series in Discrete Mathematics. John Wiley, 1983. [120] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983. [121] Yurii Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied optimization. Kluwer Academic Publishers, 2004. [122] Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005. [123] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005. [124] Yurii Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1), 2013. [125] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 449–456, 2012. [126] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, pages 1984–1992, 2010. 317 [127] Aaditya Ramdas and Aarti Singh. Optimal stochastic convex optimization through the lens of active learning. In ICML, 2013. [128] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25 (NIPS), pages 2672–2680, 2012. [129] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012. [130] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [131] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability and stability in the general learning setting. COLT, 2009. [132] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In COLT, 2009. [133] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635–2670, 2010. [134] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML, pages 807–814, 2007. [135] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. JMLR, 14:567599, 2013. [136] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, 2013. [137] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. ICML, 2013. [138] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In NIPS, pages 2199–2207, 2010. [139] Karthik Sridharan. Learning from an optimization viewpoint. PhD Thesis, 2012. 318 [140] Karthik Sridharan, Shai Shalev-Shwartz, and Nathan Srebro. Fast rates for regularized objectives. In NIPS, pages 1545–1552, 2008. [141] Ingo Steinwart. Consistency of support vector machines and other regularized kernel classifiers. Information Theory, IEEE Transactions on, 51(1):128–142, 2005. [142] Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. Journal Machine Learnning Research, 4:773–818, December 2003. [143] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization, 2009. [144] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134– 1142, 1984. [145] Vladimir N Vapnik. Statistical learning theory. 1998. [146] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971. [147] Volodimir G Vovk. Aggregating strategies. In Proc. Third Workshop on Computational Learning Theory, pages 371–383. Morgan Kaufmann, 1990. [148] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996. [149] Qiang Wu and Ding-Xuan Zhou. Svm soft margin classifiers: Linear programming versus quadratic programming. Neural Computation, 17(5):1160–1187, 2005. [150] Tianbao Yang, Mehrdad Mahdavi, Rong Jin, and Shenghuo Zhu. Regret bounded by gradual variation for online convex optimization. Journal of Machine Learning Research-Proceedings Track, 23:6–1, 2012. [151] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. JMLR., 13:1–26, 2012. [152] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In NIPS, 2013. 319 [153] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Recovering the optimal solution by dual random projection. In COLT, 2013. [154] Lijun Zhang, Tianbao Yang, Rong Jin, and Xiaofei He. O(logt) projections for stochastic optimization of smooth and strongly convex functions. ICML, 2013. [155] Tong Zhang. Sequential greedy approximation for certain convex optimization problems. Information Theory, IEEE Transactions on, 49:682–691, 2003. [156] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, pages 56–85, 2004. [157] Ding-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. Information Theory, IEEE Transactions on, 49(7):1743–1752, 2003. [158] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 928–936, 2003. 320