EXPLOITING SMOOTHNESS IN STATISTICAL LEARNING, SEQUENTIAL
PREDICTION, AND STOCHASTIC OPTIMIZATION
By
Mehrdad Mahdavi

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science and Engineering - Doctor of Philosophy
2014

ABSTRACT
EXPLOITING SMOOTHNESS IN STATISTICAL LEARNING, SEQUENTIAL
PREDICTION, AND STOCHASTIC OPTIMIZATION
By
Mehrdad Mahdavi
In the last several years, the intimate connection between convex optimization and learning problems, in both statistical and sequential frameworks, has shifted the focus of algorithmic machine learning to examine this interplay. In particular, on one hand, this intertwinement brings forward new challenges in reassessment of the performance of learning
algorithms under the assumptions imposed by convexity such as Lipschitzness, strong convexity, and smoothness. On the other hand, emergence of datasets of an unprecedented size,
demands the development of eﬃcient optimization algorithms to tackle large-scale learning
problems.
The overarching goal of this thesis is to reassess the smoothness of loss functions in
statistical learning, sequential prediction/online learning, and stochastic optimization and
explicate its consequences. In particular we examine how leveraging smoothness of loss
function could be beneficial or detrimental in these settings in terms of sample complexity,
statistical consistency, regret analysis, and convergence rate.
In the statistical learning framework, we investigate the sample complexity of learning
problems when the loss function is smooth and strongly convex and the learner is provided
with the target risk as a prior knowledge. We establish that under these assumptions, by
exploiting the smoothness of loss function, we are able to improve the sample complexity of
learning exponentially.
We also investigate the smoothness from the viewpoint of statistical consistency and

show that in sharp contrast to optimization and generalization where the smoothness is
favorable because of its computational and theoretical virtues, the smoothness of surrogate
loss function might deteriorate the binary excess risk. Motivated by this negative result, we
provide a unified analysis of three types of errors including optimization error, generalization
bound, and the error in translating convex excess risk into a binary excess risk, and underline
the conditions that smoothness might be preferred.
We then turn to elaborate the importance of smoothness in sequential prediction/online
learning. We introduce a new measure to assess the performance of online learning algorithms
which is referred to as gradual variation. The gradual variation is measured by the sum of
the distances between every two consecutive loss functions and is more suitable for gradually
evolving environments such as stock prediction. Under smoothness assumption, we devise
novel algorithms for online convex optimization with regret bounded by gradual variation.
Finally, we investigate how to exploit the smoothness of loss function in convex optimization. We propose a novel optimization paradigm that is referred to as mixed optimization
which interpolates between stochastic and full gradient methods and is able to exploit the
smoothness of loss functions to obtain faster convergence rates in stochastic optimization,
and condition number independent accesses of full gradients in deterministic optimization.
We also propose eﬃcient projection-free optimization algorithms to tackle the computational challenge arising from the projection steps which are required at each iteration of
most existing gradient based optimization methods to ensure the feasibility of intermediate
solutions. In stochastic optimization setting, by introducing and leveraging smoothness, we
develop novel methods which only require one projection at the final iteration. In online
learning setting, we consider online convex optimization with soft constraints where the
constraints are allowed to be satisfied on long term.

To my parents, Asieh and Rashid.

iv

ACKNOWLEDGMENTS

First and foremost, I feel indebted to my advisor, Professor Rong Jin, for his guidance,
encouragement, and inspiring supervision throughout the course of this research work. His
patience, extensive knowledge, and creative thinking have been the source of inspiration for
me. He was available for advice or academic help whenever I needed and gently guided
me for deeper understanding, no matter how late or inconvenient the time is. When I was
struggling to quit my Ph.D. at Sharif University to join Rong’s group, I was not sure about
my decision, but after four and a half years, I am happy to say that I did not make a wrong
decision. It’s hard to express how thankful I am for his unwavering support over the last
years.
I would like to take on this opportunity to thank my thesis committee members PanNing Tan, Ambuj Tewari, and Eric Torng who have accommodated my timing constraints
despite their full schedules, and provided me with precious feedback for the presentation of
the results, in both written and oral form.
During my Ph.D. studies, I had the pleasure of collaborating with many researchers from
each and every one of which I had things to learn, and the quality of my research was
considerably enhanced by these interactions. I would like to thank Tianbao Yang and Lijun
Zhang for all the discussions we had and the fun moments we spent on doing research and
attending conferences. The results of Chapter 5 and Chapter 8 represent part of the fruits
of these collaborations. I also spent a summer as intern at Microsoft Research working with
Ofer Dekel and a summer at NEC Research Labs working with Shenghuo Zhu. I learned
a lot from them and would like to express my gratitude for having me as an intern. I also
would like to thank Elad Hazan, Satyen Kale, Phil Long, Shai Shalev-Shwartz, and Ohad
v

Shamir for some helpful email correspondence.
Living in East Lansing without my good friends would not have been easy. I want to
thank all my friends in the department and outside the department. I wish I could name
you all.
Last but definitely not least, I want to express my deepest gratitude to my beloved
parents and my dearest siblings. Their love and unwavering support have been crucial to
my success, and a constant source of comfort and counsel. Special thanks to my parents for
abiding by my absence in last five years.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . .
1.1 Classical Statistical Learning . . . . . . . . . . . . .
1.1.1 Smoothness and sample complexity . . . . .
1.1.2 Smoothness and binary excess risk . . . . .
1.2 Sequential Prediction/Game Theoretic Learning . .
1.2.1 Smoothness and regret bounds . . . . . . . .
1.3 Convex Optimization and Learning . . . . . . . . .
1.3.1 Smoothness and convergence rate . . . . . .
1.3.2 Smoothness and projection-free optimization
1.4 Main Contributions . . . . . . . . . . . . . . . . . .
1.4.1 Statistical Learning . . . . . . . . . . . . . .
1.4.2 Sequential Prediction/Online Learning . . .
1.4.3 Stochastic Optimization . . . . . . . . . . .
1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . .
1.6 Bibliographic Notes . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
2
5
5
6
8
9
11
11
12
12
14
16
20
21

Chapter 2 Preliminaries . . . . . . . . . . . . . . . . . . . . . .
2.1 Statistical Learning . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Statistical Learning Model . . . . . . . . . . . . . . .
2.1.2 Empirical Risk Minimization . . . . . . . . . . . . . .
2.1.3 Surrogate Loss Functions and Statistical Consistency
2.1.4 Convex Learning Problems . . . . . . . . . . . . . . .
2.2 Sequential Prediction/Online Learning . . . . . . . . . . . .
2.2.1 Mistake Bound Model and Regret Analysis . . . . . .
2.2.2 Online Convex Optimization and Regret Bounds . . .
2.2.2.1 Online Gradient Descent . . . . . . . . . . .
2.2.2.2 Follow The Perturbed Leader . . . . . . . .
2.2.2.3 Follow The Regularized Leader . . . . . . .
2.2.2.4 Online Mirror Descent . . . . . . . . . . . .
2.2.2.5 Online Newton Step . . . . . . . . . . . . .
2.2.3 Variational Regret Bounds . . . . . . . . . . . . . . .
2.2.4 Bandit Online Convex Optimization . . . . . . . . .
2.2.5 From Regret to Risk Bounds . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

23
23
23
27
30
33
35
36
41
43
45
47
49
51
52
54
57

vii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2.3

Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Oracle Complexity of Optimization . . . . . . . . . . . . .
2.3.2 Deterministic Convex Optimization . . . . . . . . . . . . .
2.3.2.1 Gradient Descent Method . . . . . . . . . . . . .
2.3.2.2 Accelerated Gradient Descent Method . . . . . .
2.3.2.3 Mirror Descent Method . . . . . . . . . . . . . .
2.3.2.4 Mirror Prox Method . . . . . . . . . . . . . . . .
2.3.2.5 Conditional Gradient Descent Method . . . . . .
2.3.3 Stochastic Convex Optimization . . . . . . . . . . . . . . .
2.3.4 Convex Optimization for Learning Problems . . . . . . . .
2.3.5 From Stochastic Optimization to Convex Learning Theory

Chapter 3 Passive Learning with Target Risk
3.1 Setup and Motivation . . . . . . . . . . . . .
3.2 The Curse of Stochastic Oracle . . . . . . .
3.3 The ClippedSGD Algorithm . . . . . . . . .
3.3.1 The Algorithm Description . . . . . .
3.3.2 Main Result on Sample Complexity .
3.4 Analysis of Sample Complexity . . . . . . .
3.5 Proofs of Sample Complexity . . . . . . . .
3.5.1 Proof of Lemma 3.6 . . . . . . . . . .
3.5.2 Proof of Lemma 3.7 . . . . . . . . . .
3.6 Summary . . . . . . . . . . . . . . . . . . .
3.7 Bibliographic Notes . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

59
60
62
63
65
66
68
70
71
74
77

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

. 79
. 80
. 82
. 85
. 85
. 89
. 92
. 97
. 97
. 98
. 101
. 102

Chapter 4 Statistical Consistency of Smoothed Hinge
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Classification Calibration and Surrogate Risk Bounds
4.3 Binary Excess Risk for Smoothed Hinge Loss . . . . .
4.3.1 ψ-Transform for Smoothed Hinge Loss . . . .
4.3.2 Bounding E(h) based on Eϕ (h) . . . . . . . .
4.4 A Unified Analysis . . . . . . . . . . . . . . . . . . .
4.4.1 Bounding Smooth Excess Convex Risk Eϕ (h) .
4.4.2 Bounding Binary Excess Risk E(h) . . . . . .
4.5 Proofs of Statistical Consistency . . . . . . . . . . . .
4.5.1 Proof of Theorem 4.5 . . . . . . . . . . . . . .
4.5.2 Proof of Theorem 4.6 . . . . . . . . . . . . . .
4.5.3 Proof of Theorem 4.11 . . . . . . . . . . . . .
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . .

Loss
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

105
106
110
112
113
114
116
117
123
124
124
126
127
127

Chapter 5 Regret Bounded by Gradual Variation
5.1 Variational Regret Bounds . . . . . . . . . . . . .
5.2 Gradual Variation and Necessity of Smoothness .
5.3 The Improved FTRL Algorithm . . . . . . . . . .
5.4 The Online Mirror Prox Algorithm . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

129
131
133
140
144

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

viii

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

146
148
149
151
153
155
158
158
163
165
171
175
176

Chapter 6 Gradual Variation for Composite Losses . . . .
6.1 Composite Losses with a Fixed Non-smooth Component .
6.1.1 A Simplified Online Mirror Prox Algorithm . . . .
6.1.2 A Gradual Variation Bound for Online Non-Smooth
6.2 Composite Losses with an Explicit Max Structure . . . . .
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .
. . . . . . . .
. . . . . . . .
Optimization
. . . . . . . .
. . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

178
180
180
183
184
195

Chapter 7 Mixed Optimization for
7.1 Motivation . . . . . . . . . . . .
7.2 The MixedGrad Algorithm . . .
7.3 Analysis of Convergence Rate .
7.4 Proofs of Convergence Rate . .
7.4.1 Proof of Lemma 7.7 . . .
7.5 Summary . . . . . . . . . . . .
7.6 Bibliographic Notes . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

196
197
200
205
211
211
216
217

Convex Losses
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .

.
.
.
.
.
.
.
.

219
220
222
226
231
235
236
237

.
.
.
.
.
.

239
240
244
246
249
251

5.5
5.6

5.7
5.8

5.4.1 Online Mirror Prox Method with General Norms . . . . . .
5.4.2 Online Linear Optimization . . . . . . . . . . . . . . . . . .
5.4.3 Prediction with Expert Advice . . . . . . . . . . . . . . . . .
5.4.4 Online Strictly Convex Optimization . . . . . . . . . . . . .
5.4.5 Gradual Variation Bounds which Hold Uniformly over Time
Bandit Online Mirror Prox with Gradual Variation Bounds . . . . .
Proofs of Gradual Variation . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . .
5.6.2 Proof of Theorem 5.5 . . . . . . . . . . . . . . . . . . . . . .
5.6.3 Proof of Corollary 5.12 . . . . . . . . . . . . . . . . . . . . .
5.6.4 Proof of Theorem 5.13 . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .

Smooth Losses
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

Chapter 8 Mixed Optimization for Smooth and Strongly
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 The Epoch Mixed Gradient Descent Algorithm . . . . . .
8.3 Analysis of Convergence Rate . . . . . . . . . . . . . . .
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

Chapter 9 Eﬃcient Optimization with Bounded Projections
9.1 Setup and Motivation . . . . . . . . . . . . . . . . . . . . . .
9.2 Stochastic Frank-Wolfe Algorithm . . . . . . . . . . . . . . .
9.3 Stochastic Optimization with Single Projection . . . . . . .
9.3.1 General Convex Functions . . . . . . . . . . . . . . .
9.3.2 Strongly Convex Functions with Smoothing . . . . .
ix

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

9.4

9.5

9.6

9.7
9.8

Analysis of Convergence Rate . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Convergence Rate for General Convex Functions . . . . . . . .
9.4.2 Convergence Rate for Strongly Convex Functions . . . . . . .
Online Optimization with Soft Constraints . . . . . . . . . . . . . . .
9.5.1 An Impossibility Theorem . . . . . . . . . . . . . . . . . . . .
9.5.2 An Online Algorithm with Vanishing Violation of Constraints
9.5.3 An Online Algorithm without Violation of Constraints . . . .
Proofs of Convergence Rates . . . . . . . . . . . . . . . . . . . . . . .
9.6.1 Proof of Lemma 9.10 . . . . . . . . . . . . . . . . . . . . . . .
9.6.2 Proof of Lemma 9.11 . . . . . . . . . . . . . . . . . . . . . . .
9.6.3 Proof of Lemma 9.13 . . . . . . . . . . . . . . . . . . . . . . .
9.6.4 Proof of Lemma 9.14 . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

APPENDIX

. . . . . . . . . . . . . . .

BIBLIOGRAPHY

. . . . . . . . . . . . .

x

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

254
254
257
260
261
267
273
278
278
279
281
282
283
284

. . . . . . . . . . . . . . . . . . . 288
. . . . . . . . . . . . . . . . . 288

LIST OF TABLES

Table2.1

Lower bound on the oracle complexity for stochastic/deterministic
first-order optimization methods. Here ρ, α, and β are the Lipschitzness, strong convexity, and smoothness parameters, respectively.
The parameter κ is the condition number of function and is defined
as κ = β/α. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

Table8.1

The optimal iteration complexity of convex optimization. L and λ
are the moduli of smoothness and strongly convexity, respectively.
κ = L/λ is the conditional number. . . . . . . . . . . . . . . . . . . 221

Table8.2

Experimental results on Adult data set . . . . . . . . . . . . . . . . 233

Table8.3

Experimental results on RCV1 data set . . . . . . . . . . . . . . . . 233

Table8.4

The testing accuracy on RCV1 and Adult data sets. . . . . . . . . . 234

Table8.5

The computational complexity for minimizing (1/n)

xi

∑n

i=1 fi (w)

. . 236

LIST OF FIGURES

Figure 2.1

Figure 2.2

Figure 2.3

Figure 5.1

Illustrations of the 0-1 loss function, and three surrogate convex loss
functions: hinge loss, logistic loss, and exponential loss as scalar functions of y⟨w, x⟩. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

Reduction of general online convex optimization problem to online
optimization with linear functions. . . . . . . . . . . . . . . . . . . .

47

Reduction of bandit online convex optimization to online convex optimization with full information. The full OCO needs to play from a
shrinked domain (1 − ξ)W to ensure that the sampled points belong
to the domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Illustration of the main idea behind the proposed improved FTRL
and online mirror prox methods to attain regret bounds in terms
of gradual variation for linear loss functions. The learner plays the
ˆ t instead of wt to suﬀer less regret when the consecutive
decision w
loss functions are gradually evolving. . . . . . . . . . . . . . . . . . 140

xii

LIST OF ALGORITHMS

Algorithm 1 ClippedSGD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Algorithm 2 Linearalized Follow The Regularized Leader for OCO . . . . . . . . . . . . . . . 134
Algorithm 3 Improved FTRL (IFTRL) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Algorithm 4 Online Mirror Prox (OMP) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Algorithm 5 Online Mirror Prox Method for General Norms . . . . . . . . . . . . . . . . . . . . . 146
Algorithm 6 Deterministic Online Bandit Convex Optimization . . . . . . . . . . . . . . . . . . 157
Algorithm 7 A Simplified General Online Mirror Prox Method . . . . . . . . . . . . . . . . . . . 182
Algorithm 8 Online Mirror Prox Method with a Fixed Non-Smooth Component . . 184
Algorithm 9 Online Mirror Prox Method with an Explicit Max Structure. . . . . . . . .187
Algorithm 10 MixedGrad Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Algorithm 11 Epoch Mixed Gradient Descent (EMGD) Algorithm . . . . . . . . . . . . . . . 225
Algorithm 12 SGD with ONE Projection by Primal Dual Updating (SGD-PD) . . . 247
Algorithm 13 SGD with ONE Projection by a Smoothing Technique (SGD-ST) . . 253
Algorithm 14 Online Gradient Descent with Soft Constraints. . . . . . . . . . . . . . . . . . . . .269

Chapter 1
Introduction
In machine learning the goal is to learn from labeled examples in order to predict the labels
of unseen examples. That is, given a training set, we aim to learn a hypothesis, or a classifier
that assigns labels to samples that have never been observed by the algorithm. Eﬃciently
finding a hypothesis based on the training set which minimizes some measure of performance
is the main focus of machine learning.
In order to study the learning problem is a mathematical framework, it is necessary
to define the framework in which the algorithm is to function. Basically there are two
frameworks that have gained significant popularity within the last two decades: the statistical
learning framework and the sequential prediction or online learning framework. In both
settings mathematicical optimization theory plays an important role by providing a unified
framework to investigate the computational issues of learning algorithms. Additionally, tools
from convex optimization underline the analysis in algorithmic machine learning that targets
most of the practical learning algorithms in both frameworks.
This chapter is devoted to an overview of these three broad topics of statistical learning,
sequential prediction/online learning, and convex optimization, aiming to develop a general
correspondence between the first two and convex optimization. In particular by characterizing sample complexity, statistical consistency, regret analysis, and convergence rate in terms
of the properties of loss functions such as Lipschitzness, strong convexity, and smoothness,

1

we elaborate the importance of smoothness 1 and explicate its consequences. Here we move
towards the definitions in a fairly non-technical manner and the formal definitions will be
given in Chapter 2.

1.1

Classical Statistical Learning

We begin by stating the basic problem of binary classification in the standard passive supervised learning setting (also called batch learning). In binary classification, the learning
algorithm is given a set of labeled examples S = ((x1 , y1 ), · · · , (xn , yn )) drawn independent
and identically distributed (i.i.d.) from a fixed but unknown distribution D over the space
Ξ = X × Y, where X is the instance space and Y is the label (target) space. The goal,
with the help of provided labeled examples, is to output a hypothesis or classifier h from a
predefined hypothesis class H = {h : X → Y} that does well on unseen examples coming
from the same distribution. In other words, we would like to find a hypothesis to generalize
well from the training set to the entire domain of examples.
To measure the performance of a classifier h ∈ H on unseen samples, we utilize a loss
function ℓ : H × Ξ → R+ . The mostly used loss function in binary classification problem
with instance space defined as Ξ = Rd × {−1, +1} is the 0-1 loss ℓ(h, (x, y)) = I[h(x) ̸= y],
where I[·] is the indicator function. The risk of a particular classifier h ∈ H is the probability
that the classifier does not predict the correct label on a random data point generated by the
underlying distribution D, i.e., LD (h) = P(x,y)∼D [h(x) ̸= y] = E(x,y)∼D [ℓ(h, (x, y))]. This
1 The precise definition will be given later. We say that a continuously diﬀerentiable function f : Rd → R
is β-smooth if its gradient is Lipschitz with constant β, i.e., |∇f (w) − ∇f (w′ )| ≤ β∥w − w′ ∥.

2

performance measure is also called generalization error or true risk in statistical learning
community. An equivalent way to express the generalization bound is the sample complexity
analysis. Roughly speaking, the sample complexity of an algorithm is the number of examples
which is suﬃcient to ensure that, with probability at least 1 − δ (w.r.t. the random choice
of S), the algorithm picks a hypothesis with an error that is at most ϵ from the optimal
one. The diﬀerence between the risk of a particular classifier h and of the optimal classifier
h∗ = arg minh∈H LD (h) is called the excess risk of h, i.e., E(h) = LD (h) − LD (h∗ ).
Since the underlying distribution D is unknown to the learner, it is impossible for learner
to directly minimize the generalization error or true risk. Therefore, one has to resort to
using the training data in S to estimate the probabilities of error for a classifiers in H. This
alternative approach is known as the Empirical Risk Minimization (ERM) method and aims
to pick a hypothesis which has small error on the training set, i.e., empirical risk. The
performance of the empirical risk minimization has been throughly investigated and well
understood using tools from empirical process theory. It is a well stablished fact that a
problem is learnable with ERM method if and if only if the empirical error for all hypothesis
in H converges uniformly to the true risk. Furthermore, the uniform convergence holds if
the complexity of hypothesis class H satisfies some combinatorial characteristics. This is one
of the main achievements of statical learning theory to characterize, and establish necessary
and suﬃcient conditions for the learnability of learning problems using the ERM rule.
While the ERM method is theoretically appealing, from a practical point of view one
would like to consider problems that are eﬃciently learnable which is referred to as computational complexity of learning algorithm. This issue becomes more important by noting
the fact that in many cases ERM approach suﬀers from substantial problems such as computational requirements in minimizing 0-1 loss over training set. Indeed, solving the ERM
3

problem for 0-1 loss function is known to be an NP-hard problem. Consequently, it is natural to consider loss functions that act as surrogates for the non-convex 0-1 loss, and lead
to practical algorithms. Of course, such a surrogate loss must be reasonably related to the
original binary loss function since otherwise this approach fails. For classification problem,
good surrogate loss functions have been recently identified, and the relationship between the
excess classification risk and the excess risk of these surrogate loss functions has been exactly
described.
An important family of learning problems that can be learnt eﬃciently are called Convex
Learning Problems. In general, a convex learning problem is a setting in which the surrogate loss function and the hypothesis space H are both convex. This setting encompasses
an enormous variety of well-know practical learning algorithms such as regression, support
vector machines (SVMs), boosting, and logistic regression, where these algorithms diﬀer in
the type of the convex loss function being used as the surrogate of the 0-1 loss.
Interestingly, for convex learning problems the ERM rule, of minimizing the empirical
convex loss over a convex domain H, becomes a convex optimization problem, making an
intimate connection between machine learning and mathematical optimization. Learnability
in this setting departs from the learnability via ERM method and strongly depends on the
characteristics of convex domain such as boundedness and the analytical properties (curvature) of loss function such as Lipschitzness, smoothness (i.e, diﬀerentiable with Lipschitz
gradients), and strong convexity (i.e., at any point one can find a convex quadratic lower
bound for the function). Beyond learnability, the sample complexity of learning algorithms
can also be characterized in terms of the analytical properties of loss function. Therefore,
smoothness and strong convexity of convex surrogate loss function play a crucial role in
characterizing learnability and analysis of sample complexity of convex learning problems.
4

1.1.1

Smoothness and sample complexity

While the main focus of statistical learning theory was on understanding learnability and
sample complexity by investigating the complexity of hypothesis class in terms of known combinatorial measures under uniform convergence property, recent advances in online learning
and optimization theory opened a new trend in understanding the generalization ability of
learning algorithms in terms of the characteristics of loss functions being used in convex
learning problems. In particular, a staggering number of results have focused on strong
convexity of loss function and obtained better generalization bounds which are referred to
as fast rates. In terms of smoothness of loss function, recently it has shown that under this
assumption, it is possible to obtain optimistic rates (in the sense that smooth losses yield
better generalization bounds when the problem is easier) which are more appealing than
the case the convex surrogate loss is Lipschitz continuous. This motivates us to take a step
forward in this direction and investigate the smoothness of loss functions in more depth.

1.1.2

Smoothness and binary excess risk

As noted above, the convex surrogates of the 0-1 loss are highly preferred because of the
computational and theoretical virtues that convexity brings in. Since the choice of convex
surrogates could significantly aﬀect the binary excess risk, the relation between risk bounds
in terms of binary 0-1 loss and it is corresponding convex surrogate has been the focus of
learning community over the last decade. It was delivered that the binary excessive risk can
be upper bound by the convex excessive risk through a transform function that only depends
on the surrogate convex loss.
Although a great deal of work has been devoted to understanding the relation between

5

binary excess risk and convex excess risk, there remain a variety of open problems. In
particular, this transformation is well understood under mild conditions such as convexity
and it is unclear how other properties of convex surrogates such as smoothness may aﬀect this
relation. This becomes more critical if we consider smooth surrogates as witnessed by the
fact that the smoothness is further beneficial both computationally- by attaining an optimal
convergence rate for optimization error, and in a statistical sense- by providing an improved
optimistic rate for generalization bound. Given these positive news of using smooth convex
surrogates, an open research question is how the smoothness of a convex surrogate will aﬀect
the binary excess risk. So we are thrived to investigate the impact of the smoothness of a
convex loss function on transforming the excess risk in terms of the convex surrogate loss
into the binary excess risk.

1.2

Sequential Prediction/Game Theoretic Learning

An alternative paradigm to analyze the learning problems is the sequential or online learning
that can be phrased as a repeated two-player game between the learner and the adversary,
making an intimate connection between learning and adversarial game theory.
In sequential prediction/online learning framework, the learner is faced with a sequence of
samples appearing at discrete time intervals and is required to make predictions sequentially.
In contrast to statistical setting, in which, the data source is typically assumed to be i.i.d.
with an unknown distribution, in the online framework we relax or eliminate any stochastic
assumptions imposed on the samples and they might be chosen adversarially. As a result,
online learning framework is better suited for adversarial and interactive learning tasks such
as spam email detection and stock market prediction where decisions of the learner could

6

negatively aﬀect future instances the learner receives.
By dropping the statistical assumptions on the observed sequence, it is not immediately
clear how the prediction problem can be made meaningful and which goals are reasonable.
One popular possibility is to measure the performance of the learner by the loss he/she has
accumulated during the learning process and compare it to the loss of best fixed solution. The
cumulative loss suﬀered on a sequence of rounds is the sum of instantaneous losses suﬀered
on each one of the rounds in the sequence. In particular, the goal becomes to minimize the
gap between the cumulative loss of the online learner and the loss of a strategy that selects
the best action fixed in hindsight. This performance gap is called the regret. The analysis
of regret mainly focuses on investigating how the regret depends on the length of the time
horizon the game proceeds. We note that the best fixed action is chosen form a comparator
class of predictors against which the learner will be compared and can only be computed in
full knowledge of the sequence of loss functions.
Regret analysis stands in stark contrast to the statistical framework in which the learner
is evaluated based on his/her accuracy after seeing all training examples, making the online
learning setting inherently harder. The theoretical utility of online learning has long been
appreciated. More recently, it has become the mainstay of optimization, where it serves as
computational platform from which a variety of large-scale learning problems can be solved.
The analogous of statistical learnability in online setting is referred to as Hannan consistency.
A hypothesis class is learnable in online setting, i.e., Hannan consistent, if for any sequence
of samples, there exists an algorithm which attains sub-linear regret in terms of number of
rounds the interaction proceeds. Interestingly, unlike statistical learning theory, the analysis
of online learning is mostly algorithmic where eﬃcient algorithms are proposed to solve the
learning problem and its performance is analyzed to guarantee the Hannan consistency.
7

Recently tools from convex optimization made it possible to capture many online learning
problems under a generic problem template, and in many circumstances obtain improved regret bounds. This unified framework, which is referred to as Online Convex Optimization,
assumes that the learner is forced to make decisions from a convex set and the adversary is
supposed to play convex functions. Additionally, it has been demonstrated that the curvature of convex loss functions played by the adversary such as strong convexity gives a great
advantage to the player to attain better regret bounds. Surprisingly, tools from online optimization also provided insights to get better convergence rates or more eﬃcient algorithms
for some stochastic and deterministic convex optimization problems.

1.2.1

Smoothness and regret bounds

Unlike strong convexity, the smoothness of loss functions is not a desirable property in online
setting as it yields the same regret bounds as the loss functions being Lipschitz continuous.
However, there are scenarios that the smoothness of the sequence of loss functions played by
the adversary becomes important. One such scenario is the online learning from loss functions
that might have some patterns and not being fully adversarial. For example, the weather
condition or the stock price at one moment may have some correlation with the next and
their diﬀerence is usually small, while abrupt changes only occur sporadically. Therefore
devising online convex optimization algorithms which can take into account the gradual
behavior of the environment and at the same time protect against the worst case sequences
would be more desireable. In terms of regret analysis, this translates to having algorithms
with regret bounded in terms of variation of loss functions instead of time horizon that is
main measure in the standard setting of sequential prediction. In these evolving settings,
the smoothness of loss function becomes critical. More importantly, no gradual variation
8

bound is achievable if the loss functions are no longer smooth. This necessitates the need
to develop online methods that exploit the smoothness assumption in the learning process
or in the analysis to obtain improved regret bounds in terms of variation of the sequence of
loss functions and underlines our motivation in this thesis.

1.3

Convex Optimization and Learning

In the problem of convex optimization, we are interested in minimizing a given convex
function f : Rd → R form a predefined family of convex functions F over a convex set
W ⊆ Rd . The goal is to find an approximate solution with an accuracy ϵ, i.e., finding a
w ∈ W where f (w) − minw∈W f (w) ≤ ϵ. A typical optimization algorithm initially chooses
a point from the feasible convex set w0 ∈ W and iteratively updates these points based on
some information about the function at hand until it achieves the desired accuracy.
To capture the eﬃciency of an optimization procedure, we follow the black-box model
of optimization 2 . In this model we assume that there exists an oracle which provides
information about the query points such as function value, gradient, and second gradient
(i.e, Hessian). The number of queries issued to an oracle to find a solution with a predefined
level of accuracy is called oracle complexity when it is stated in terms of desired accuracy ϵ
or equivalently convergence rate when it is stated in terms of the number of queries.
As already mentioned, learning problems under both statistical and online learning frameworks can be directly formulated as optimization problems. In statistical setting and espe2 As indicated by Yurii Nesterov in his seminal book [121], in general, optimization problems are unsolvable
and we need to relax the goal to make it reachable.

9

cially convex learning problems, the learning algorithm corresponds to the optimization
algorithm that solves the minimization problem of picking a hypothesis from the set of hypotheses that minimizes empirical loss over training sample. Similarly, in the online convex
optimization, the online learner iteratively chooses decisions from a closed, bounded and
non-empty convex set and encounters convex cost functions.
Formulating and investigating both statistical and online learning problems in the context
of convex optimization makes an intimate connection between learning and mathematical
optimization. Therefore, the study of fast iterative methods for approximately solving convex
programming problems is a central focus of research in convex optimization, with important
applications in machine learning, and many other areas of computer science. The usefulness
of convex optimization in the development of various learning algorithms is well established
in the past several years. Additionally, challenges exist in machine learning applications
demand the development of new optimization algorithms.
In optimization for supervised machine learning and in particular the empirical risk minimization paradigm with convex surrogates and gradient information, there exist two regimes
in which popular algorithms tend to operate: deterministic (also known as batch optimization or full gradient method) regime in which whole training data are used to compute the
gradient at each iteration and stochastic regime which samples a fixed number of training
samples per iteration, typically a single training sample, to compute the gradient at each
iteration. Although stochastic optimization methods suﬀer from the low convergence rate
in comparison to batch methods, the lightweight computation per iteration makes them attractive for many large-scale learning problems. Hence, with the increasing amount of data
that is available for training, stochastic convex optimization has emerged as the most scalable approach for large-scale machine learning which is known to yield moderately accurate
10

solutions in a relatively short time.
We emphasize that the role of convex optimization goes beyond computational issues and
it also provides tools to characterize the learnability in convex learning problems via eﬃcient
stochastic optimization algorithms for learning these problems.
Analogous to both statistical and online learning frameworks, the curvature of function
to be optimized, significantly aﬀects the convergence rate of optimization methods. Perhaps
the most extensively studied are strong convexity and smoothness of function.

1.3.1

Smoothness and convergence rate

Exploiting smoothness of loss function, in particular in stochastic optimization, to obtain
better convergence rate has been one the main research challenges in recent years. Despite
enormous advances in exploiting smoothness in deterministic optimization, it has not been
utilized in stochastic optimization. In particular, stochastic optimization of smooth loss
functions exhibits the same convergence rate as stochastic optimization under Lipschitzness assumption of function. Therefore, this thesis is motivated by the need of developing
stochastic optimization algorithms with better convergence rates under smoothness assumption. The key question is whether or not smoothness property of loss functions could be
leveraged to develop much faster stochastic optimization methods.

1.3.2

Smoothness and projection-free optimization

At the core of many iterative constrained optimization algorithms in both online and stochastic convex optimization is a projection step to ensure the feasibility of solutions for intermediate iterations. This is a serious deficiency, since in many applications the projection

11

onto the constrained domain might be computationally expensive and sometimes as hard as
solving the original optimization problem. It is therefore of considerable interest to devise
optimization methods which do not require projection steps or need a bounded number of
projection operations. At it will became clear later in this thesis, by smoothing a strongly
convex objective function, we are able to reduce the number of projections into a single
projection at the end of the optimization process. In contrast to the other parts of the
thesis where we assume and exploit the smoothness, this is the only result that injects and
leverages the smoothness to gain from the merits of smoothness to be able to devise more
eﬃcient algorithms.

1.4

Main Contributions

In this section we shall elaborate on the main problems considered in this thesis and our key
contributions to address these problems. A common theme in all of the algorithms is that
they exploit the smoothness of loss function for more eﬃcient methods.

1.4.1

Statistical Learning

• Logarithmic sample complexity for learning from smooth and strongly convex losses with target risk. The first problem we consider in this thesis has a
statistical nature. In particular, we consider learning in passive setting but with a
slight modification. We assume that the target expected loss, also referred to as target
risk, is provided in advance for learner as prior knowledge. Unlike most studies in
the learning theory that only incorporate the prior knowledge into the generalization
bounds, we are able to explicitly utilize the target risk in the learning process.
12

Our analysis reveals a surprising result on the sample complexity of learning: by exploiting the target risk in the learning algorithm, we show that when the loss function
( ( ))
is both smooth and strongly convex, the sample complexity reduces to O log 1ϵ , an
exponential improvement compared to the sample complexity O( 1ϵ ) for learning with
strongly convex loss functions.
Unlike the previous works on sample complexity, the proof of our result is constructive
and is based on a computationally eﬃcient stochastic optimization algorithm which
makes it practically interesting. The proposed ClippedSGD algorithm uses knowledge
of the target risk to appropriately clip gradients obtained from a stochastic oracle. The
clipping is beneficial because it reduces the variance in stochastic gradients and makes
it possible to reduce the sample complexity. This happens under the assumption that
the loss function is smooth and strongly convex.
• Statistical consistency of smoothed hinge loss. The second problem we address
in statistical learning setting is to investigate the relation between the excess risk that
can be achieved by minimizing the empirical binary risk and the excess risk of smooth
convex surrogates.
As mentioned earlier, convex surrogates of the 0-1 loss are highly preferred because
of the computational and theoretical virtues that convexity brings in. This is of more
importance if we consider smooth surrogates as witnessed by the fact that the smoothness is further beneficial both computationally- by attaining an optimal convergence
rate for optimization, and in a statistical sense- by providing an improved optimistic
rate for generalization bound. However, we investigate the smoothness property from
the viewpoint of statistical consistency and show how it aﬀects the binary excess risk

13

for smoothed hinge loss. In particular, we intend to answer the following fundamental
questions:
”How does the smoothness of surrogate convex loss aﬀect the binary excess risk? Considering the advantages of smooth losses in terms of optimization and generalization, is
it beneficial or detrimental in terms of statistical consistency? Under what conditions
on these three types of errors it is better to use smooth losses? ”
We show that in contrast to optimization and generalization errors that favor the
choice of smooth surrogate loss, the smoothness of loss function may deteriorate the
binary excess risk. Motivated by this negative result, we provide a unified analysis
that integrates optimization error, generalization bound, and the error in translating
convex excess risk into a binary excess risk when examining the impact of smoothness
on the binary excess risk. We show that under favorable conditions appropriate choice
of smooth convex surrogate loss will result in a binary excess risk that is better than
√
O(1/ n) which is unimprovable for general non-smooth Lipschitz losses.

1.4.2

Sequential Prediction/Online Learning

• Regret bounded by gradual variation for smooth online convex optimization. As our third problem, we study the online convex optimization problem under
the assumption that even the loss functions are arbitrary, but there is a hidden pattern that can be exploited in learning process. Therefore, an interesting question that
inspires our work in the analysis of online learning algorithms is the following:
”Can we have online algorithms that can take advantage of benign sequences and at
the same time protect against the adversarial sequences? ”

14

To answer this question, we introduce the gradual variation, measured by the sum of
the distances between every two consecutive loss functions, to asses the performance of
online learning algorithms in gradually evolving environments such as stock prediction.
We propose two novel algorithms, an Improved Follow the Regularized Leader (IFTRL)
algorithm and an Online Mirror Prox (OMP) method, that achieve a regret bound
which only scales as the square root of the gradual variation for the linear and general
smooth convex loss functions.
To establish the main results, we discuss a lower bound for online gradient descent, and
a necessary condition on the smoothness of the cost functions for obtaining a gradual
variation bound. For the closely related problem of prediction with expert advice, we
show that an online algorithm modified from the multiplicative update algorithm can
also achieve a similar regret bound for a diﬀerent measure of deviation. Finally, for
loss functions which are strongly convex in applications such as portfolio management
problem, we show a regret which is only logarithmic in terms of the gradual variation.
The gradual variation- in addition to its intrinsic interest as an extension of regret
analysis- has several specific consequences. First, since gradual variation lower bounds
the regret bound, devising algorithm with small gradual variation is also guarantees to
achieve small regret. Second, algorithms with small gradual variation are specifically
designed to attain small variation, therefore they can capture the correlation between
loss functions if it exist and boost the performance.
• Gradual variation for composite online convex optimization. As an impossibility result for obtaining gradual variation for convex losses, we show that for nonsmooth functions when the only information presented to the learner is the first order

15

information about the cost functions, it is impossible to obtain a regret bounded by
gradual variation. However, we show that a gradual variation bound is achievable for
a special class of non-smooth functions that is composed of a smooth component and
a non-smooth component.
We consider two categories for the non-smooth component. In the first category, we
assume that the non-smooth component is a fixed function and is relatively easy such
that the composite gradient mapping can be solved without too much computational
overhead compared to gradient mapping. In the second category, we assume that
the non-smooth component can be written as an explicit maximization structure. In
general, we consider a time-varying non-smooth component, present a primal-dual
prox method, and prove a min-max regret bound by gradual variation. When the
non-smooth components are equal across all trials, the usual regret is bounded by the
min-max bound plus a variation in the non-smooth component.

1.4.3

Stochastic Optimization

• Improved convergence rate for stochastic optimization of smooth losses. We
then turn to exploiting smoothness in stochastic optimization. Recently stochastic optimization methods have experienced a renaissance in the design of fast algorithms for
large-scale learning problems. Unlike the optimization methods based on full gradients, the smoothness assumption was not exploited by most of stochastic optimization
methods. More importantly, for general Lipschitz continuous convex functions, simple stochastic optimization methods such as stochastic gradient descent exhibit the
same convergence rate as that for the smooth functions, implying that smoothness of

16

the loss function is essentially not very useful and can not be exploited in stochastic
optimization. Therefore, by noting this significant gap between the convergence rate
of optimization of smooth functions in stochastic and deterministic optimization, the
natural question arises is:
”Can smoothness property of function be exploited to speed up the convergence rate of
stochastic optimization of smooth functions? ”.
We will provide an aﬃrmative answer to this question. In particular, we propose a
novel optimization paradigm which interpolates between stochastic and full gradient
methods and is able to exploit the smoothness of loss functions in optimization process
to obtain faster rates. The results show an intricate interplay between stochastic and
deterministic convex optimization. The MixedGrad algorithm we propose fits in the
mixed optimization paradigm and is an alternation of deterministic and stochastic
gradient steps, with diﬀerent of frequencies for each type of steps. We show that it
attains an O(1/T ) convergence rate for smooth losses.
• Condition number independent accesses of full gradient oracle for smooth
and strongly convex optimization. The optimal iteration complexity of the gradi√
ent based algorithm for smooth and strongly convex objectives is O( κ log 1/ϵ), where
κ is the conditional number (the ratio of strong convexity to smoothness parameters).
Despite its linear convergence rate in terms of target accuracy ϵ, in the case that the
optimization problem is ill-conditioned, we need to evaluate a larger number of full
gradients, which could be computationally expensive. Therefore, a natural question is:
”Can we manage the dependency on the condition number and devise optimization
methods independent of condition number in accessing the full gradient oracle? ”

17

We show that in the mixed optimization regime introduced in this thesis, we may
also leverage the smoothness assumption of loss functions to devise algorithms with
iteration complexities that are independent of condition number in accessing the full
gradient oracle. We utilize the idea of mixed optimization in progressively reducing the
variance of stochastic gradients to optimize smooth and strongly convex functions, and
propose the Epoch Mixed Gradient Descent (EMGD) algorithm that is independent of
condition number in accessing the full gradients. Similar to the MixedGrad algorithm,
a distinctive step in EMGD is the mixed gradient descent, where we use a combination
of the full gradient and the stochastic gradient to update the intermediate solutions. By
performing a fixed number of mixed gradient descents, we are able to improve the suboptimality of the solution by a constant factor, and thus achieve a linear convergence
rate. Theoretical analysis shows that EMGD is able to find an ϵ-optimal solution by
computing O(log 1/ϵ) full gradients and O(κ2 log 1/ϵ) stochastic gradients.
We also provide experimental evidence complementing our theoretical results for classification problem on few medium-sized data sets.
• Eﬃcient projection-free online and stochastic convex optimization. Another
problem we address in this thesis is eﬃcient projection-free optimization methods for
stochastic and online convex optimization. Our motivation stems from the observation
that most of the gradient-based optimization algorithms require a projection onto the
convex set W from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets this is the main
computational bottleneck and may be ineﬃcient in practice. For instance, for many
applications in machine learning such as metric learning, the convex domain is the

18

positive semidefinite cone for which is the projection step requires a full eigendecomposition. For many other problems, the projection step is itself an oﬄine optimization
problem and might be as hard as solving the original optimization problem. This observation immediately leads to the following question that inspires our work:
”To what extent it is possible to reduce the number of expensive projection steps in
online and stochastic optimization? Can we trade expensive projection steps oﬀ for
other types of light computational operations?.”
We consider this problem in two settings: stochastic optimization and online convex
optimization. In stochastic setting, we develop novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the
last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, the proposed algorithms achieve an
√
O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for
strongly convex optimization under mild conditions about the domain and the objective function. The key insight which underlines the proposed projection-free algorithm
for strongly convex functions is smoothing the objective function. This is in contrast
to other problems in this thesis where we try to leverage the smoothness of objective, while here we introduce smoothness to gain from its computational virtues in
alleviating the projection steps.
In online setting, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to a constrained convex domain for all rounds,
we only require that the constraints, which define the convex set, be satisfied in the
long run. By turning the problem into an online convex-concave optimization prob-

19

√
lem, we propose an eﬃcient algorithm which achieves an O( T ) regret bound and an
O(T 3/4 ) bound for the violation of constraints. Then we modify the algorithm in order
to guarantee that the constraints are satisfied in the long run. This gain is achieved
at the price of getting O(T 3/4 ) regret bound. We also prove an impossibility result
which shows that simple ideas such as augmenting the objective function with penalized constraints fail to solve the problem and results in a linear bound O(T ) for either
the regret or the violation of the constraints.

1.5

Thesis Overview

The remainder of this thesis is organized as follows. Chapter 2 lays out the foundation for
the rest of the thesis. In particular, we provide a survey of some of the background material
from statistical learning, sequential prediction/online learning theory, and as well as convex
optimization. It will become clear in this chapter that there exist deep connections between
these three areas.
The first part of the thesis focuses on the statistical learning, investigating the sample
complexity of learning when the target risk is known to the learner and the consistency of
smoothed hinge loss. In Chapter 3 we focus on statistical learning with target risk under the
assumption that the loss function is smooth and strongly convex. Chapter 4 investigates the
consistency of smoothed hinge loss and provides negative and positive results on transforming
its excess risk to a binary excess risk.
The second part of the thesis is on sequential prediction/online learning and introduces
the gradual variation measure to asses the performance of online convex optimization algorithms in gradually evolving environments. Chapter 5 discusses the necessity of smoothness

20

to obtain regret bounds in terms of gradual variation, followed by two eﬃcient algorithms
to obtain gradual variation bounds for smooth online convex optimization problems. The
adaption to other settings such as expert advice problem and strongly convex loss functions
are also discussed. The extension of results to special composite loss functions with a smooth
component is discussed in Chapter 6.
The third part of the thesis is devoted to devising eﬃcient stochastic optimization algorithms by leveraging the smoothness of loss functions. We propose the mixed optimization
paradigm for stochastic optimization in Chapter 7 and extend to smooth and strongly convex losses in Chapter 8. The stochastic optimization methods with bounded projections and
online optimization with soft constraints are elaborated in Chapter 9.
Finally, the appendix summarizes rather standard things on convex analysis and concentration inequalities that are used in the proof of results in the thesis and is mainly for
reference. In order to facilitate independent reading of various chapters, some of the definitions from convex analysis are even repeated several times.

1.6

Bibliographic Notes

Some of the results in this dissertation have appeared in prior publications. The material
in Chapter 3 is based on a work published in Conference on Learning Theory (COLT)[102]
and the content of Chapter 4 is new [108]. The material in Chapter 5 and Chapter 6 come
from [41] which is published at COLT and its extended version has been recently published
in Machine Learning journal [150]. The results in Chapter 7 and Chapter 8 follow [107]
and [152], respectively, which are appeared in Advances in Neural Information Processing
Systems (NIPS). The content of Chapter 9 is mostly compiled from [106], [104], and [105]

21

which are published at NIPS and Journal of Machine Learning Research (JMLR).

22

Chapter 2
Preliminaries
The goal of this chapter is to give a gentle and formal overview of the material related
to the work has been done in this thesis. In particular, we will discuss key concepts and
questions in statistical learning, online learning, and convex optimization and will highlight
the role of analytical properties of loss functions such as Lipschitzness, strong convexity, and
smoothness in all these settings. The exposition given here is necessarily very brief and the
detailed discussion will be provided in the relevant chapters.

2.1
2.1.1

Statistical Learning
Statistical Learning Model

In a typical supervised statistical learning problem (also known as passive learning or batch
learning), we are given an instance space X , and a space Y of labels or target values. Each
element in the data domain X represents an object to be classified, e.g., the content of
an email in spam detection application or the features of an image in vision applications.
The target space Y can be either discrete Y = {−1, +1}, as in the case of classification, or
continuous Y = R, as in the case of regression. To model learning problems in statistical
or probabilistic setting, we assume that the product space Ξ ≡ X × Y is endowed with a
probability measure D which is unknown to the learner. However, it is possible to sample

23

an arbitrary number of pairs S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) ∈ Ξn according to the
underlying distribution D. We term this set of examples the training set, or the training
sample. The existence of distribution D is necessary to ensure that the already collected
samples S have something in common with the new and unseen data.
A hypothesis or classifier h : X → Y is a function that assigns labels h(x) ∈ Y to any
instance x ∈ X such that the assigned label is a good approximation of the possible response
y to an arbitrary instance x generated according to the distribution D. In other words,
the hypothesis h captures the functional relationship between the input instances and the
output, which in turn makes it possible to predict the output value for future input instances.
In light of no free lunch theorem [148], learning is impossible unless we make assumptions
regarding the nature of the problem at hand. Therefore, when approaching a particular
learning problem, it is desirable to take into account some prior knowledge we might have
about our problem. In this regard, we assume that the learning algorithm is confined to a
predetermined set of candidate hypotheses H = {X → Y} to which we wish to compare
the result of our learning algorithm. In a specific learning context, the hypothesis class
can represent our beliefs on the true nature of the classification rule for the problem. For
example the hypothesis class for binary classification might be a subset of a vector space with
bounded norm which represents the linear classifiers, i.e., H = {x → sign(⟨w, x⟩ + b), w ∈
Rd , b ∈ R, ∥w∥ ≤ R}.
In order to measure the performance of a learning algorithm, we usually use a loss function
ℓ : H × Ξ → R+ . The instantaneous loss incurred by a learning algorithm on instance
z = (x, y) ∈ Ξ for picking hypothesis h ∈ H is given by ℓ(h, z). For example, in binary
classification problems, Ξ = X × {−1, 1}, H is a set of functions h : X → {−1, 1}, and the
loss function is the binary or 0-1 loss defined as ℓ0−1 (h(x), y) = I[h(x) ̸= y], where I[·] is the
24

indicator function that takes value 1 if its argument is true and 0 otherwise. In regression
or real classification problems, where the goal is to predict real valued labels Y = R, the
common loss function used to evaluate the performance of a regressor h on sample z = (x, y)
is the squared loss ℓ(h, z) = (h(x) − y)2 .
A classifier is constructed on the basis of the n independent and identically distributed
(i.i.d) samples S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) from Ξ. The ultimate goal of a typical
learning algorithm is to pick a classifier h ∈ H that is competitive with respect to the best
hypothesis from H with respect to the expected risk or generalization error defined as:

LD (h) = Ez∼D [ℓ(h, z)],

where E[·] denotes the expectation with respect to the (unknown) probability distribution
D underlying our samples and ℓ(h, (x, y)) is the binary loss I[h(x) ̸= y] for classification and
squared loss (h(x) − y)2 for regression problem. For binary classification the generalization
error of a hypotheses h : X → {−1, +1} is simply the probability that it predicts the wrong
label on a randomly drawn instance from Ξ, i.e., LD (h) = P(x,y)∼D [h(x) ̸= y].
The diﬀerence between the risk of a particular classifier h and of the optimal classifier
h∗ = arg minh∈H LD (h) is called the excess risk of h, i.e.,

E(h) = LD (h) − LD (h∗ ).

In designing any typical solution to a supervised machine learning problem, there are few
key questions that must be considered. The first of these concerns approximation that characterizes how rich the solution space H is to approximate the true underlying model. The

25

second fundamental issue concerns estimation that characterizes how well the obtained solution performs in making future predictions on unseen data and how much training samples
suﬃces to find the solution. The third key question concerns the computational eﬃciency
that characterizes how eﬃciently can we make use of the training data to choose an accurate
hypothesis.
The basic model to analyze learning algorithms in computational learning theory is
the Probably Approximately Correct (PAC) model proposed by the pioneering work of
Valiant [144]. It applies to learning binary valued functions and uses the 0-1 loss under
realizability assumption, i.e., the algorithm gets samples that are consistent with a hypothesis in a fixed class H, ∃ h∗ ∈ H; P(x,y)∼D [h∗ (x) = y] = 1. In PAC model we bound the loss of
the algorithm with a high probability over the random draw of samples. A decision-theoretic
extension of the PAC framework which is known as agnostic learning is introduced by [87]
that generalizes the PAC model to general loss functions and without assuming realizability
assumption as defined below:
Definition 2.1 (Agnostic PAC Learnability). A hypothesis class H is agnostic PAC learnable
with respect to Ξ and a loss function ℓ : H×Ξ → R+ , if there exists a function mH : (0, 1)2 →
N and a learning algorithm A with the following property: for every ϵ, δ ∈ (0, 1) and for any
distribution D over the domain Ξ, when running the algorithm A on m ≥ mH (ϵ, δ) i.i.d.
examples generated by D, the algorithm returns h ∈ H such that, with probability of at least
1 − δ,
LD (h) ≤ min LD (h′ ) + ϵ.
′
h ∈H

If further A runs in poly(1/ϵ, 1/δ, n), then H is said to be eﬃciently agnostic PAC-learnable.
The goal of the PAC framework is to understand how large a data set needs to be in order
26

to give good generalization. It also provides bounds for the computational cost of learning.
In agnostic PAC learnability there are two fundamental questions that need to be addressed
carefully: these are computational eﬃciency and sample complexity.
The computational aspect of learning measures the amount of computation required to
implement a learning algorithm. The sample complexity of an algorithm is the number of
examples which is suﬃcient to ensure that, with probability at least 1 − δ (w.r.t. the random
choice of S), the algorithm picks a hypothesis with an error that is at most ϵ from the optimal
one. We note that while computational complexity concerns the eﬃciency of learning, the
sample complexity is a statistical measure and concerns the diﬃculty of learning from the
hypothesis H with respect to the underlying distribution D. An equivalent way to present
the sample complexity is to give a generalization bound. It states that with probability at
least 1 − δ, to attain a risk LD (h) which departs from the optimal risk minh′ ∈H LD (h′ ) by
at most ϵ, is upper bounded by some quantity that depends on the sample size n and δ.
Sample complexity of passive learning is well established and goes back to early works in
(
)
(
)
1
1
1
1
1
1
the learning theory where the lower bounds Ω ϵ (log ϵ + log δ ) and Ω 2 (log ϵ + log δ )
ϵ

were obtained in classic PAC and general agnostic PAC settings, respectively [52, 26, 7]. It
worth emphasizing that the PAC framework is a distribution-free model and we are interested in sample-complexity guarantees that hold regardless of the distribution D from which
examples are drawn.

2.1.2

Empirical Risk Minimization

Since in the probabilistic setting, we assume that there is an underlying probability distribution D over the sample space X × Y which captures the relationship between the samples
given to the algorithm during training and the new instances it will receive in the future;
27

the training examples must be ‘representative’ in some way of the examples to be seen in
the future. Clearly, learning is hopeless if there is no correlation between past and present
rounds.
Note however that the distribution D is not known to the learner; the learner sees the
distribution only through the training examples S = (z1 , z2 , · · · , zn ) ∈ Ξn , and based on
these examples, must learn to predict well on new instances from the same distribution.
Therefore, we cannot compute the generalization error directly and machine learning aims
to find estimators based on the observed data samples S .
A simple and well-known learning approach is the empirical risk minimization (ERM)
method. Basically, the idea of ERM is to replace the unknown true risk

LD = P(x,y)∼D [h(x) ̸= y]

by its empirical counterpart rooted in the training set S and minimize this empirical risk as
defined below:
LS (h) =

1
{i : i ∈ [n] and h(xi ) ̸= yi } .
n

The empirical error LS (h) of any hypothesis h ∈ H is its average error over the training
samples in S, while the generalization error LD (h) is its expected error based on a random
sample realized by the distribution D. We note that the empirical error is a useful quantity,
since it can easily be determined from the training data and it provides a simple estimate of
the true error. The empirical loss over the training data provides an estimate whose loss is
close to the optimal loss if the class H is suﬃciently large so that the loss of the best function
in H is close to the optimal loss and is small enough so that finding the best candidate in
H based on the data is computationally feasible. In this regard, generalization error bounds
28

give an upper bound on the diﬀerence between the true and empirical error of functions in
a given class, which holds with high probability with respect to the sampling of the training
set.
Then, our task becomes to evaluate the expected risk relying on the empirical error.
Having this quantity bounded, a learning algorithm may choose the hypothesis that is the
most accurate on the sample, and is guaranteed that its loss on the distribution will also
be low. The statistical learning theory is concerned with characterizing learnability and
providing bounds on the deviations of this estimate from the expected error. One of its
main achievements is a complete characterization of the necessary and suﬃcient conditions
for generalization of ERM, and for its consistency.
A fundamental answer, formally proven for supervised classification and regression, is
that learnability is equivalent to uniform convergence, and that if a problem is learnable, it
is learnable via empirical risk minimization.
Definition 2.2 (Uniform Convergence). A hypothesis class H has the uniform convergence
property with respect to Ξ and the loss function ℓ : H × Ξ → R+ , if for any probability
distribution D over Ξ, there exists a function mH : (0, 1)2 → N such that for any sample S
of size mH (ϵ, δ) drawn i.i.d based on D, with probability at least 1 − δ, ∀h ∈ H it holds:

LS (h) − LD (h) ≤ ϵ.

Hence, the crucial step towards proving learnability is to obtain a result on the uniform
convergence of sample errors to true errors. Uniform convergence of empirical quantities
to their mean provides ways to bound the gap between the expected risk and the empirical risk by the complexity of the hypothesis set. Hence, the complexity of the hypothesis
29

class H is the critical factor in determining the distribution-free sample-complexity of a supervised learning problem. Several complexity measures for hypothesis classes have been
proposed, each providing a diﬀerent type of guarantee including the Vapnik-Chervonenkis
(VC) dimension [146] and the Rademacher complexity [15, 91]. The main virtue of the
Vapnik-Chervonenkis theorem and Rademacher complexity is that they convert the problem of uniform deviations of empirical averages into a combinatorial and data dependent
problems, respectively.
We note that uniform convergence arguments is not the only possible way to characterize
learnability. Since the first results of Vapnik and Chervonenkis on uniform laws of large
numbers for classes of binary valued functions, there has been a considerable amount of
work aiming at obtaining generalizations and refinements of these bounds. These techniques
include sample compression [55], algorithmic stability [32], and PAC-Bayesian analysis [112]
which also have been shown for characterizing learnability and proving generalization bounds.
We will also discuss the stochastic optimization machinery [133] to characterize learnability
in general settings later in this chapter.

2.1.3

Surrogate Loss Functions and Statistical Consistency

Although ERM approach has a lot of theoretical merits, since we should seek to minimize the
training error based on 0-1 loss, it typically is a combinatorial problem, leading to NP-hard
optimization problem which is not computationally realizable.
A common practice to circumvent this diﬃculty is to revert to minimize a surrogate
loss function, i.e., to replace the indicator function by a surrogate function and find the
minimizer with respect to this surrogate function. Obviously, the surrogate loss needs to
be computationally easy to optimize, while close in some sense to the 0-1 loss. In partic30

ℓ(w, (x, y))
0-1 loss
Hinge loss
Logistic loss
Exponential loss

.

y⟨w, x⟩

Figure 2.1: Illustrations of the 0-1 loss function, and three surrogate convex loss functions:
hinge loss, logistic loss, and exponential loss as scalar functions of y⟨w, x⟩.
ular, if the surrogate function is assumed to be convex, it allows the optimization to be
performed eﬃciently with only modest computational resources. Examples of such surrogate loss functions for 0-1 loss include logistic loss ℓlog (h, (x, y)) = log(1 + exp(−yh(x)))
in logistic regression [61], hinge loss ℓhinge (h, (x, y)) = max(0, 1 − yh(x)) in support vector
machines (SVMs) [43] and exponential loss ℓexp (h, (x, y)) = exp(−yh(x)) in boosting (e.g.,
AdaBoost [58]). When the hypothesis class H consists of functions that are linear in a
parameter vector w, i.e., linear classifiers, these loss functions are depicted in Figure 2.1.
Having defined the surrogate loss functions, then the task is to minimize the relaxed
empirical loss in terms of the surrogate losses. However, in practice, the ubiquitous approach
to find the solution is the regularized empirical risk minimization which adds a regularization
function R(h) to the objective and solves
{

}
n
1∑
hS ∈ arg min LS (h) + R(h) ≡
ℓ(h, zi ) + R(h) .
n
h∈H
i=1

31

(2.1)

The goal of introducing regularizer is to prevent over-fitting. Of course, given some training
data, it is always possible to build a function that fits exactly the data. But, in the presence of
noise, this may lead to a poor performance on unseen instances. An immediate consequence
of adding the regularization term is to favor simpler classifiers to increase its generalization
capability. Some of the commonly used regularizers in the literature are R(h) = ∥h∥22 ,
and R(h) = ∥h∥1 . We note that solving the optimization problem in (2.1) is a convex
optimization problem for which eﬃcient algorithms exist to find a near optimal solution in
a reasonable amount of time.
Although the idea of replacing the non-convex 0-1 loss function with convex surrogate
loss functions seems appealing and resolves the eﬃciency issue of the ERM method, but
it has statistical consequences that must be balanced against the computational virtues of
convexity. The question then is how well does minimizing such a convex surrogate perform
relative to minimizing the actual classification error. Statistical consistency concerns this
issue. Consistency requires convergence of the empirical risk to the expected risk for the
minimizer of the empirical risk together with convergence of the expected risk to the minimum risk achievable by functions in H [14]. An important line of research in statistical
learning theory focused on relating the convex excess risk to the binary excess risk. It is
known that under mild conditions, the classifier learned by minimizing the empirical loss of
convex surrogate is consistent to the Bayes classifier [156, 101, 80, 97, 141, 14]. For instance,
it was shown in [14] that the necessary and suﬃcient condition for a convex loss ℓ(·) to be
consistent with the binary loss is that ℓ(·) is diﬀerentiable at origin and ℓ′ (0) < 0. It was
further established in the same work that the binary excessive risk can be upper bound by
the convex excess risk through a ψ-transform that depends on the surrogate convex loss
ℓ(·). A detailed elaboration of this issue will be given in Chapter 4 where we examine the
32

statistical consistentency of smooth convex surrogates.

2.1.4

Convex Learning Problems

We now turn our attention to convex learning problems where Ξ be an arbitrary measurable
set, H be a closed, convex subset of a vector space, and the loss function ℓ(h, z) be convex
w.r.t. its first argument. This family of learning encompasses a rich body of existing learning
methods for which eﬃcient algorithms exists such as support vector machines, boosting
and logistic regression. Convex learning problems makes an important family of learning
problems, mainly because most of what we can learn eﬃciently falls into this family.
Before diving into formal definition, we need to familiarize ourselves with the following definitions about convex analysis [28, 121] which will come in handy throughout this
dissertation (for standard definitions about convex analysis see Appendix ??).
Definition 2.3 (Convexity). A set W in a vector space is convex if for any two vectors
w, w′ ∈ W, the line segment connecting two points is contained in W as well. In other
fords, for any λ ∈ [0, 1], we have that λw + (1 − λ)w′ ∈ W. A function f : W → R is said
to be convex if W is convex and for every w, w′ ∈ W and α ∈ [0, 1],

f (λw + (1 − λ)w′ ) ≤ λf (w) + (1 − λ)f (w′ ).

A continuously diﬀerentiable function is convex if f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ for all
w, w′ ∈ W. If f is non-smooth then this inequality holds for any sub-gradient g ∈ ∂f (w′ ).
The formal definition of convex learning problems is given below.
Definition 2.4 (Convex Learning Problem). A learning problem with hypothesis space H,
33

instance space Ξ = X × Y, and the loss function ℓ : Ξ × H → R+ is said to be convex if the
hypothesis class H is a parametrized convex set H = {hw : x → ⟨w, x⟩ : w ∈ Rd , ∥w∥ ≤ R}
and for all z = (x, y) ∈ Ξ, the loss function ℓ(·, z) is a non-negative convex function.
In the remainder of thesis, when it is clear from the context, we will represent the hypothesis class with W and simply use vector w to represent hw , rather than working with
hypothesis hw .
We note that for convex learning problems, the ERM rule becomes a convex optimization
problem which can be eﬃciently solved. This stands in sharp contrast to non-convex loss
functions such as 0-1 loss for which solving the ERM rule is computationally cumbersome
and known to be NP-hard. Obviously, this eﬃciency comes at a price and not every convex
learning problem is guaranteed to be learnable and convexity by itself is not suﬃcient for
learnability. This requires to impose more assumptions on the setting to ensure the learnability of the problem. In particular, it can been shown that if the hypothesis space W is
bounded and the loss function is Lipschtiz or smooth as formally defined below, then the
convex learning problem is learnable [133, 130].

Definition 2.5 (Lipschitzness). A function f : W → R is ρ-Lipschtiz over the set W if for
every w, w′ ∈ W we have that |f (w) − f (w′ )| ≤ ρ||w − w′ ||.

Definition 2.6 (Smoothness). A diﬀerentiable loss function f : W → R is said to be βsmooth with respect to a norm ∥ · ∥, if it holds that

f (w) ≤ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ +
34

β
∥w − w′ ∥2 , ∀ w, w′ ∈ W.
2

(2.2)

We note that smoothness also follows if the gradient of the loss function is β-Lipschtiz,
i.e., ∥∇f (w) − ∇f (w′ )∥ ≤ β∥w − w′ ∥. Smooth functions arise, for instance, in logistic and
least-squares regression, and in general for learning linear predictors where the loss function
has a Lipschitz continuous gradient.
There has been an upsurge of interest over the last decade in finding tight upper bounds
on the sample complexity of convex learning problems by utilizing prior knowledge on the
curvature of the loss function, that led to stronger generalization bounds in agnostic PAC
setting. In [95] fast rates obtained for squared loss, exploiting the strong convexity of this
loss function, which only holds under pseudo-dimensionality assumption. With the recent
development in online strongly convex optimization [68], fast rates approaching O( 1ϵ log 1δ )
for convex Lipschitz strongly convex loss functions has been obtained in [83, 140, 82] and for
exponentially concave loss functions in [103]. For smooth non-negative loss functions, [138]
improved the sample complexity to optimistic rates for non-parametric learning using the
notion of local Rademacher complexity [13].

2.2

Sequential Prediction/Online Learning

The statistical model discussed above, first assumes the existence of a stochastic model
for generating instances according to the underlying distribution D, and then samples a
training set S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) and investigates the ERM strategy to find
a hypothesis h ∈ H which generalizes well on unseen instances. Although, this model is valid
for cases for which a tractable statistical model reasonably describes the underlying process,
but it may be unrealistic in practical problems where the process is hard to model from a
statistical viewpoint and may even react to the learners’s decisions, e.g., applications such

35

as portfolio management, computational finance, and whether prediction.
Sequential prediction/online learning (also known as universal prediction of individual
sequences) is a strand of learning theory avoiding making any stochastic assumptions about
the way the observations are generated and the goal is to develop prediction methods that are
robust in the sense that they work well even in the worst case. For instance, in the problem
of online portfolio management [18], an online investor wants to distribute her wealth on
a set of available financial instruments without any assumption on the market outcome in
advance. Obviously in applications of this kind, the main challenge is that the learner can
not make any statistical assumption about the process generating the instances and the data
are continuously evolving or adversarially changing. Online learning or sequential prediction
is an elegant paradigm to capture these problems that alleviates the statistical assumption
usually made in statistical setting (see e.g., [38] and [129] for through discussion).

2.2.1

Mistake Bound Model and Regret Analysis

The problem of sequential prediction may be cast as a repeated game between a decision
maker- also called the forecaster- and an environment- also called adversary. In this model,
learning proceeds in T consecutive rounds, as we see examples one by one. At the beginning
of round t, the learning algorithm A has the hypothesis ht ∈ H and the adversary picks
an instance zt = (xt , yt ). The adversary at round t can select the instance zt ∈ Ξ in an
adversarial worst case fashion based on previous instances z1 , . . . , zt−1 and based on previous
hypotheses h1 , . . . , ht−1 selected by the learner. Then, the learner receives the instance xt
and predicts ht (xt ). At the end of the round, the true label yt is revealed to the learner
and A makes a mistake if ht (xt ) ̸= yt . Unlike statistical setting, here the prediction task is
sequential: the outcomes are only revealed one after another; at time t, the learner guesses
36

the next outcome yt before it is revealed. The algorithm then updates its hypothesis, if
necessary, to ht+1 and this continues till time T .
The sequential prediction model discussed above, which is reminiscent of the framework
of competitive analysis [27], is known as mistake bound model and was introduced to learning
community in [98]. In this model the goal of the learner is to sequentially deduce information
from previous rounds so as to improve its predictions on future rounds. When the algorithm
is conservative (or lazy), meaning that the algorithm only changes its hypothesis when it
makes a mistake is called mistake driven. We note that many seemingly unrelated problems
fit in the framework of the abstract sequential decision problem including online prediction
problems in the experts model [37], Perceptron like classification algorithms [25], Winnow
algorithm [98], and learning in repeated game playing [59]- to name a few. Here we list few
sequential prediction problems to better illustrate the setting.

Example 1: Online classification and regression. As an illustrative example let us
consider the online binary classification problem where at each round t the learner receives
an instance xt ∈ X as input and is required to predict a label yˆt ∈ Y = {−1, +1} for the
input instance. Then, the learner receives the correct label yt ∈ Y and suﬀers the 0-1 loss:
I[yt ̸= yˆt ]. As another example, let us consider online regression problems where at each
round t a feature vector xt ∈ Rd is given to the online learner, and a value yt ∈ R has to
be estimated using linear predictors with bounded norm, i.e, H = {x → w, x : w ∈ Rd :
∥w∥ ≤ R}, that the learner predicts ⟨w, xt ⟩. The loss function at round t for a predictor w
is ℓt (w) = (yt − ⟨wt , xt ⟩)2 .

Example 2: Prediction with expert advice. Every day the manager of a company
37

should decide to produce one of the K diﬀerent products without knowing the market demand
in advance. At the end of the day, he will be informed the gain achieved by selling the product
but nothing about the potential income from other products. The goal of the manager is
to maximize the income of company over a sequence of many periods. This is a problem
of repeated decision-making which is called learning from expert advice where the objective
functions to be optimized are unknown and revealed (perhaps only partially) in an online
manner. In the general prediction with expert advice game a learner competes against
K ∈ N experts in a game consisting of T rounds. Each round t, each expert reveals a
prediction from Y = {0, 1}. The learner form its own prediction by sampling an expert from
ht = wt ∈ H ≡ ∆K , where ∆K is the set of probabilities over K experts (i.e., simplex).
The true outcome yt is then revealed and the learner and all of the experts receive a penalty
depending on how well their prediction fits with the revealed outcome. The aim of the learner
in this game is to incur a cumulative loss over all rounds that is not much worse than the
best expert.
One natural measure of the quality of learning in mistake bound model of sequential
setting is the number of worst case mistakes the learner makes. In particular, under the
realizability assumption (i.e. where there exists a hypothesis in H which performs perfectly
on the sequence), the learner’s goal becomes to have a bounded number of mistakes which is
known as mistake bound. The optimal mistake bound for a hypothesis class H is the minimum
mistake bound over all learning algorithms A on the worst case sequence of examples:

Mistake(A, H, Ξ, T ) =

min

max

T
∑

h1 ,h2 ,··· ,hT ∈H x1 ,x2 ,··· ,xT ∈X
t=1

I[ht (xt ) ̸= h(xt )],

where h1 , h2 , · · · , hT ∈ H is the sequence of hypothesis generated by A.

38

(2.3)

We say that a hypothesis class H is learnable in the online learning model if there exists
an online learning algorithm A with a finite worst case mistake bound, no matter how long
the sequence of examples T is. We note that the mistake bound model, in comparison
to PAC model, is strong in the sense that it does not depend on any assumption about
the instances. It is also remarkable that despite the inherent diﬀerences between PAC and
mistake bound frameworks, mistake bounds have corresponding risk bounds that are not
worse, and sometimes better, than those obtainable with a direct statistical approach. In
particular, by a simple reduction, it is straightforward to show that if an algorithm A learns
a hypothesis class H in the mistake bound model, then A also learns H in the probably
approximately correct model.
We note that due to impossibility theorem by Cover [45], any online predictor that
makes deterministic predictions is doomed to achieve a sub-linear regret universally for all
sequences. To circumvent this obstacle, two typical solutions have been examined which are
randomization and convexification. In the former we allow the learner to make randomized
predictions, making the algorithm unpredictable against the adversary, and in the latter we
replace the non-convex 0-1 loss with a cover surrogate loss function (see e.g., [24] and [129]).
Similar to statistical learning, we can also generalize online setting to the agnostic (a.k.a.
non-realizable) setting where there is no classifier in H which performs perfectly on the
sequence. In this case an adversary can make the cumulative loss of our online learning
algorithm arbitrarily large. To overcome this deficiency, the performance of the forecaster
is compared to some notion of ”how well it could have performed”. In particular, the
performance of the online learner is compared to that of the best single decision for the
sequence, in hindsight, chosen from the hypothesis in H. This brings us to the objective

39

which is commonly known as regret which is formally defined by:

Regret(A, H, Ξ, T ) =

T
∑

ℓ(ht , (xt , yt )) − min

h∈H

t=1

T
∑

ℓ(h, (xt , yt ))

(2.4)

t=1

where ℓ(h, (x, y)) is the loss function to measure the discrepancy between the prediction
h(x) and the corresponding observed element, e.g., 0-1 loss function I[h(x) ̸= y] for binary
classification.
Regret measures the diﬀerence between the cumulative loss of the learner’s strategy and
the minimum possible loss had the sequence of loss functions been known in advance and
the learner could choose the best fixed action in hindsight. In particular, we are interested
in rates of increase of Regret(A, H, Ξ, T ) in terms of T . When this is sub-linear in the
number of rounds, that is, o(T ), we call the solution Hannan consistent [38], implying that
the learner’s average per-round loss approaches the average per-round loss of the best fixed
action in hindsight. It is noticeable that the performance bound must hold for any sequence
of loss functions, and in particular if the sequence is chosen adversarially.

Online learnability. In the online setting, the analogous of PAC learnability was addressed
by Littlestone [98] who described a combinatorial characterization of hypothesis classes that
are learnable in mistake bound model under realizability assumption. The extension of
these results to agnostic online setting was addressed in [20]. Recall that, in the PAC
model, VC-dimension of hypothesis class H characterizes learnability of H if we ignore
computational considerations. Moroever, VC-dimension characterizes learnability in the
agnostic PAC model as well. In online setting what is known as Littlestone’s dimension plays
the same rule. Recently, notion of Sequential Rademacher complexity has been introduced to

40

characterizing online learnability which plays the similar role as the Rademacher complexity
in statistical learning theory [126].

2.2.2

Online Convex Optimization and Regret Bounds

The online convex optimization (OCO) framework generalizes many known online learning
problems in the realm of sequential prediction and repeated game playing. Among these are
online classification and sequential portfolio optimization. The unified setting of OCO was
introduced in [63] and the exact term was used before in [158]. Since the introduction of
OCO, there have been a dizzying number of extensions and variants that is the focus of this
section.
Assume we are given a fixed convex set W and some set of convex functions F on W. In
OCO, a decision maker is iteratively required to choose a decision wt ∈ W. After making
the decision wt at round t, a convex loss function ft ∈ F is chosen by adversary and the
decision maker incurs loss ft (wt ). The loss function is chosen completely arbitrarily and
even in an adversarial manner given the current and past decisions of the decision maker.
Online linear optimization is a special case of OCO in which the set F is the set of linear
functions, i.e., F = {w → ⟨f , w⟩ : f ∈ Rd }. The goal of online convex optimization is to
come up with a sequence of solutions w1 , . . . , wT that minimizes the regret, which is defined
as the diﬀerence in the cost of the sequence of decisions accumulated up to the trial T made
by the learner and the cost of the best fixed decision in hindsight, i.e.

Regret(A, W, F, T ) =

T
∑

ft (wt ) − min

t=1

w∈W

T
∑

ft (w).

t=1

Based on the type of feedback revealed to learner by adversary at the end of each iteration,
41

we distinguish two types of OCO problem. In the full information OCO, after suﬀering the
loss, the decision maker gets full knowledge of the function ft (·). In the partial information
setting (also bandit OCO), the decision maker only learns the value ft (wt ) and does not
gain any other knowledge about ft (·).
We also distinguish between oblivious and adapive adversaries. In the oblivious or nonadaptive model, adversary is assumed to know our algorithm, and can pick the worst possible
sequence of cost functions for it. However this sequence must be fixed in advance before game
starts; during the game, adversary receives no feedback about our chosen decisions. In the
more powerful adaptive model, the adversary is assumed to know not only our algorithm,
but also the history of the game up to the current round. In other words, at the end of
each round t, our decision wt is revealed to adversary, and the next cost function ft (·) may
depend arbitrarily on w1 , · · · , wt .
The design of algorithms for regret minimization in OCO setting recently has been influenced by tools from the convex optimization. It has long been known that special kinds of
loss functions permit tighter regret bounds than other loss functions. Two most important
family of loss functions that has been considered are convex Lipschtiz and strongly convex
functions.
Before presenting the known results on regret bounds for diﬀerent families of loss functions, we will first need the definition of strong convexity as:
Definition 2.7 (Strong Convexity). A loss function f : W → R is said to be α-strongly
convex w.r.t a norm ∥ · ∥, if there exists a constant α > 0 (often called the modulus of strong
convexity) such that, for any λ ∈ [0, 1] and for all w, w′ ∈ W, it holds that
1
f (λw + (1 − λ)w′ ) ≤ αf (w) + (1 − λ)f (w′ ) − λ(1 − λ)α∥w − w′ ∥2 .
2
42

If f (w) is twice diﬀerentiable, then an equivalent definition of strong convexity is ∇2 f (w) ⪰
αI which indicates that the smallest eigenvalue of the Hessian of f (w) is uniformly lower
bounded by α everywhere. When f (w) is diﬀerentiable, the strong convexity is equivalent
to
f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ +

α
∥w − w′ ∥2 , ∀ w, w′ ∈ W.
2

Henceforth, we shall review few algorithms for OCO and state their regret bounds under
diﬀerent assumptions on the curvature of the sequence of the adversarial loss functions.

2.2.2.1

Online Gradient Descent

We start with the first, perhaps the simplest online convex optimization algorithm, which
nevertheless captures the spirit of the idea behind the most of existing methods. This
algorithm applies to the most general setting of online convex optimization and is referred
to as Online Gradient Descent (OGD). The OGD method is rooted in the standard gradient
descent algorithm and was introduced to the online setting by Zinkevich [158].
The OGD method starts with an arbitrary decision w1 ∈ W, and iteratively modifies it
according to the cost functions that are encountered so far as follows:

43

Online Gradient Descent (OGD)
Input: convex set W, step size η > 0
Initialize: w1 ∈ W
for t = 1, 2, . . . , T
(
)
Play wt+1 = ΠW wt − η∇ft (wt )
Receive loss function ft+1 (·) and incur ft+1 (wt+1 )
end for

Here ΠW (·) denotes the orthogonal projection onto the convex set W.
The OGD algorithm is straightforward to implement, and updates take time O(d) given
the gradient. However, the projection step might be computationally cumbersome for complex domains W. The following theorem states that the regret bound for OGD method when
√
applied to convex Lipschitz functions is on the order of O( T ) which has been proven to be
tight up to constant factors [1].
Theorem 2.8. Let f1 , f2 , . . . , fT be an arbitrary sequence of convex, diﬀerentiable functions
defined over the convex set W ⊆ Rd . Let G = maxt∈[T ] ||∇ft (w)|| and R = maxw,w′ ∈W ||w−
w′ ||. Then OGD with step size ηt = R√ achieves the following guarantee for any w ∈ W,
G t

for all T ≥ 1.
T
∑
t=1

ft (wt ) −

T
∑
t=1

√
∥w − w1 ∥2 η ∑
+
ft (w) ≤
∥∇ft (wt )∥2 = O(RG T )
2η
2
T

(2.5)

t=1

In fact it is possible to show that OGD can attain a logarithmic regret O(log T ) for
strongly convex functions by appropriately tuning the step sizes as stated below.
44

Theorem 2.9. Let f1 , f2 , . . . , fT : W → R be an arbitrary sequence of α-strongly convex
1 , the OGD
functions. Under the same conditions as Theorem 2.8 with step size ηt = αt

algorithm achieves the following regret:
T
∑
t=1

ft (wt ) −

T
∑
t=1

ft (w) ≤ G2

T
∑
1
G2
≤
(1 + log T ) = O(log T )
αt
2α

(2.6)

t=1

Remark 2.10. In the above algorithm, the updating rule for OGD uses the gradients of
the loss functions ∇ft (w) at each iteration to update the solution. In fact it is not required
to assume that the loss functions are diﬀerentiable and it suﬃces to assume that the loss
functions only have a sub-gradient everywhere in the domain W, making the algorithm suitable for non-smooth settings. In particular for any gt ∈ ∂ft (wt ) where ∂ft (wt ) is the set
of sub-gradient at point wt , the algorithm is able to achieve the same regret bounds in both
cases.

2.2.2.2

Follow The Perturbed Leader

The first eﬃcient algorithm for the general online linear optimization problems is due to
Hannan [65], and was subsequently rediscovered and clarified in [86]. The Follow The Perturbed Leader (FTPL) algorithm assumes that there is an oracle that can eﬃciently solve
the oﬄine optimization problem. Having access to such an oracle, the FTPL selects the decision that appears to be the best so far, but for a version of actual cost vectors which have
been perturbed by the addition of some noise. The addition of random noise to the observed
cost functions has the eﬀect of slowing down the algorithm so that instead of tracking small
fluctuations in cost functions, it has the tendency to stick to the same decision unless there
is a compelling reason to switch to another decision.

45

Although the FTPL algorithm works in a similar setting as OGD, but there is a crucial
diﬀerence between them which makes FTPL more suitable for online combinatorial learning. In particular, the decision set W does not need to be convex as long as the oﬄine
optimization problem can be solved eﬃciently. This is a significant advantage of the FTPL
approach, which can be utilized to tackle more general problems with discrete decision spaces.
The FTPL method for online decision making relies on a linear optimization procedure M
over the set W that computes M(f ) = arg minw∈W ⟨f , w⟩ for all f ∈ Rd . Then, FTPL
chooses wt by first drawing a perturbation µt ∈ [0, η1 ]d uniformly at random, and computing
(∑
)
t
wt+1 = M
f
+
µ
t .
τ =1 τ
Follow The Perturbed Leader (FTPL)
Input: a general domain W, step size η > 0, oﬄine linear oracle M over W
Initialize: w1 ∈ arg minw∈W M(0)
for t = 1, 2, . . . , T
Draw µt ∈ [0, η1 ]d uniformly at random
(∑
)
t
Play wt+1 = M
f
+
µ
t
τ =1 τ
Receive loss vector ft+1 ∈ Rd and incur ⟨ft+1 , wt+1 ⟩
end for

The regret of FTPL algorithm is stated in the following theorem.
Theorem 2.11 (Regret Bound for FTPL). Let f1 , . . . , fT be an arbitrary sequence of linear
loss functions from unit ball and let w1 , . . . , wT be the sequence of decisions generated by
the FTPL constrained to a general set W. Then for any w ∈ W, the FTPL algorithm with
46

ft = ∇ft (wt )

ft (w)
.
Adversary

Linearization
wt+1 ∈ W

FTRL
wt+1 ∈ W

Figure 2.2: Reduction of general online convex optimization problem to online optimization
with linear functions.
parameter η =

√
R/GT satisfies:


T
T
∑
∑
√
E  ⟨ft , wt ⟩ −
⟨ft , w⟩ ≤ 2 RGT ,
t=1

t=1

where maxw |⟨ft , w⟩| ≤ G, t ∈ [T ] is an upper bound on the magnitude of the rewards, and
R is an upper bound on the ℓ1 diameter of W, i.e., R ≥ maxw,w′ ∈W ∥w − w′ ∥1 .

2.2.2.3

Follow The Regularized Leader

A natural modification of the basic FTPL algorithm or fictitious play in game theory is
the Follow The Regularized Leader (FTRL) algorithm, in which we minimize the loss on
all past rounds plus a regularization term. Regularization is an alternative to perturbation
to stabilize decisions during the prediction. Naturally, diﬀerent regularization functions will
yield diﬀerent algorithms for diﬀerent applications.
In FTRL we assume that the loss functions are linear ft (w) = ⟨ft , w⟩, ft ∈ Rd and the
generalization to convex loss functions can be accomplished by linearization strategy. This
reductions is shown in Figure 2.2. The main idea behind linearization trick is that if an
algorithm A is able to achieve good regret against linear loss functions, then A can be used
to achieve good regret against sequences of convex loss functions as well. To see this, we
47

note that from the definition of convexity, i.e, ft (w) ≥ ft (wt ) + ⟨∇ft (wt ), w − wt ⟩, we can
feed the learner A with linear loss functions ft′ (w) = ⟨∇ft (wt ), w⟩ and then guarantee that
ft (wt ) − ft (w) ≤ ⟨ft , wt − w⟩. Therefore, any bound on the regret of linear loss functions
directly translate to a regret bound for convex losses.
The FTRL algorithm at each round t solves an oﬄine optimization problem based based
on the sum of loss functions up to time t − 1 and a regularization function. Let R(w) be a
strongly convex diﬀerentiable function. The detailed steps of FTRL are as follows [3]:
Follow the Regularized Leader (FTRL)
Input: convex set W, step size η > 0, regularization R(w)
Initialize: w1 ∈ arg minw∈W R(w)
for t = 1, 2, . . . , T
Play wt+1 = arg minw∈W

[

]

∑t

s=1 ⟨fs , w⟩ + R(w)

Receive loss function ft+1 (·) and incur ft+1 (wt+1 )
Set ft+1 = ∇ft+1 (wt+1 )
end for

It is noticeable that what FTRL implements at each iteration is the regularized empirical
risk minimization over the previous trials as we saw in the statistical learning. In online
setting, the regularizer has the role of forcing the consecutive solutions is to stay closer
to each other, similar role role the perturbation plays in FTPL algorithm. Furthermore,
diﬀerent choices of the regularizer lead to diﬀerent algorithms. For instance, in the simplest
case if we let R(w) = 12 ||w||2 , then the FTRL behaves same as OGD algorithm.
48

It is not hard to prove a simple bound on the regret of the FTRL algorithm for a given
strongly convex regularization function R(w) and the learning rate η.
Theorem 2.12 (Regret Bound for FTRL). Let f1 , f2 , . . . , fT ∈ Rd be an arbitrary sequence of
linear loss functions over convex set W ⊆ Rd . Let R : K → R be a 1-strongly convex function
with respect to a norm ∥ · ∥. Let R = maxw∈W R(w) − R(w1 ) and maxt∈[T ] ∥ft ∥∗ ≤ G.
√
R the regret of FTRL is bounded by:
Then for any w ∈ W by setting η =
2
G T

T
∑
t=1

2.2.2.4

⟨ft , wt ⟩ −

T
∑
t=1

∑
√
R(w) − R(w1 )
⟨ft , w⟩ ≤
+η
∥ft ∥2∗ = O(G RT )
η
T

(2.7)

t=1

Online Mirror Descent

Another algorithm for OCO problem is the online version of celebrated proximal point algorithm in oﬄine convex optimization. As mentioned before, the implicit goal of regularization
used in the FTRL algorithm is to control by how much the consecutive solutions diﬀer from
each other. The proximal point algorithm is designed with the explicit goal of keeping wt+1
as close as possible to wt . The closeness of two solutions are measured by the Bregman
divergence induced by a strongly convex Legendre function Φ(·) defined over convex domain
K. A Legendre function is a strictly convex functions with continuous partial derivatives and
gradient blowing up at the boundary of its domain (see Appendix ?? for detailed discussion).
We assume that W ∩ K ̸= ∅. Online proximal point method solves an optimization problem
which expresses the tradeoﬀ between the distance from the old solution and the loss suﬀered
by by the current convex function as:
[
wt+1 = arg

min

w∈W∩K

]

ηt ft (w) + BΦ (w, wt ) ,

49

(2.8)

where BΦ (w, wt ) is the Bregman divergence induced by Φ(·) (Definition A.17). When the
loss functions are linear, i.e., ft (w) = ⟨ft , w⟩ for some ft ∈ Rd , or one replaces the objective
ft (w) in (2.8) with its linearized term, i.e., ft (w) ≈ ft (wt )+⟨w − wt , ∇ft (wt )⟩, the proximal
point method becomes the Online Mirror Descent (OMD) algorithm as detailed below:
Online Mirror Descent (OMD)
Input: convex sets W, step size η > 0, Legendre function Φ : K → R
Initialize: w1 ∈ arg minw∈K Φ(w)
for t = 1, 2, . . . , T

[

]

Play wt+1 = arg minw∈W∩K ηt ⟨w − wt , ∇ft (wt )⟩ + BΦ (w, wt )
Receive loss function ft+1 (·) and incur ft+1 (wt+1 )
end for

Theorem 2.13 (Regret Bound for OMD). Let f1 , f2 , . . . , fT be an arbitrary sequence of convex, diﬀerentiable functions defined over the convex set W ⊆ Rd . Let G = maxt ||∇ft (wt )||∗
and R = maxw,w′ ∈W ||w − w′ ||. Let Φ : K → R be a Legendre function which is 1-strongly
convex w.r.t the norm ∥ · ∥ and W ∩ K ̸= ∅. Then the regret of the OMD can be bounded by
T
∑
t=1

ft (wt ) −

T
∑
t=1

√
BΦ (w, w1 ) η ∑
+
∥∇ft (wt )∥2∗ = O(RG T ),
η
2
T

ft (w) ≤

(2.9)

t=1

where ∥ · ∥∗ is the dual norm to ∥ · ∥.
We note that many classical online learning algorithms can be viewed as variants of OMD,
generally either with the Euclidean geometry such as Perceptron algorithm and OGD, or in

50

the simplex geometry, using an entropic distance generating function such as Winnow [98]
and Online Exponentiated Gradient algorithm [90].

2.2.2.5

Online Newton Step

As mentioned in the analysis of OGD algorithm, it attains a logarithmic regret O(log T ) if
the sequence of loss functions which have bounded gradient and are strongly convex. Another
case in which we can obtain logarithmic regret is the case of exp-concave loss functions (i.e.,
the function exp(−αf (w)) is concave for some α). Exp-concavity is weaker condition than
the bounded gradient and strong convexity. Online Newton Step (ONS) [68] is the adaption
of Newton method for convex optimization to online setting and is able to achieve logarithmic
regret when learned on exp-concave functions which makes is more general than OGD.
Online Newton Step (ONS)
Input: convex sets W, step size η > 0
Initialize: w1 ∈ arg minw∈K Φ(w)
for t = 1, 2, . . . , T

(

)

Play wt+1 = ΠWt wt − ηt A−1
t [∇ft (wt )]
A

where At−1 =

∑t−1

⊤
s=1 ∇fs (ws )∇fs (ws )

Receive loss function ft+1 (·) and incur ft+1 (wt+1 )
end for

A
Here ΠA
W (·) is the projection induced by the a matrix A, i.e., ΠW (w) = minz∈W (z −

w)⊤ A(z − w). We note that compared to the provisos algorithms which only exploit firstorder information about the loss functions, the analysis of ONS method is based on second

51

order information, i.e. the second derivatives of the loss functions, whereas the implementation of ONS relies only on first-order information.
The following result shows that the ONS method achieves a logarithmic regret for expconcave functions which is stronger result than the performance of OGD for strongly convex
losses.
Theorem 2.14 (Regret Bound for ONS). Let f1 , f2 , . . . , fT be an arbitrary sequence of αexp-concave functions defined over the convex set W ⊆ Rd with G = maxt ||∇ft (wt )||. Let
R = maxw,w′ ∈W ||w − w′ ||. Then the ONS algorithm achieves the following regret bound
for any w ∈ W:
T
∑
t=1

ft (wt ) −

T
∑
t=1

ft (w) ≤

(1
α

)
+ GR d log T = O(d log T )

(2.10)

We note that also this result seems interesting, but for some of the functions of interest
such as logistic loss the exp-cancavity parameter α could be exponentially large in d [113].
The exponential dependence on the diameter of the feasible set can make this bound worse
√
than the O( T ) bound obtained by OGD.

2.2.3

Variational Regret Bounds

Most previous works, including those discussed above, considered the most general setting
in which the loss functions could be arbitrary and possibly chosen in an adversarial way.
However, the environments around us may not always be adversarial, and the loss functions
may have some patterns which can be considered to achieve a smaller regret. Consequently,
it is objected that requiring an algorithm to have a small regret for all sequences leads to
results that are too loose to be practically interesting. As a result, the bounds obtained for
52

worst case scenarios become pessimistic for these regular sequences. Therefore, it would be
desirable to develop algorithms that yield tighter bounds for more regular sequences, while
still providing protection against worst case sequences. To this end, we need to replace the
number of rounds appeared in the regret bound with some other notion of performance. In
particular, this new measure should depend on variation in the sequence of costs functions
emitted to the learner. Having such an algorithm guarantees that if the cost sequence has
low variation, the algorithm would be able to perform better.
One work along this direction is that of [69]. For the online linear optimization problem,
in which each loss function is linear ft (w) = ⟨ft , w⟩ and can be seen as a vector, they
considered the case in which the loss functions have a small variation, defined as

Variation(A, W, F, T ) =

T
∑

∥ft − µ∥22 ,

t=1

∑
where µ = Tt=1 ft /T is the average of the loss functions. For this, they showed that a re√
gret of O( Variation) can be achieved, and they also have an analogous result for the prediction with expert advice problem. According to this definition, a small Variation(A, W, F, T )
means that most of the loss functions center around some fixed loss function µ. This seems
to model a stationary environment, in which all of the loss functions are produced according
to some fixed distribution.
The variation bound is defined in terms of total diﬀerence between individual linear cost
vectors to their mean. In Chapter 5 of this thesis, we introduce another measure which
is called gradual variation. Gradual variation is more general and applies to environments
which may be evolving but is a somewhat gradual way. For example, the weather condition or
the stock price at one moment may have some correlation with the next and their diﬀerence is

53

usually small, while abrupt changes only occur sporadically. Formally, the gradual variation
of a sequence of loss functions is defined as:

GradualVarition(A, W, F, T ) =

T∑
−1
t=1

max ∥∇ft+1 (w) − ∇ft (w)∥22 .

w∈W

It is easy to verify that the gradual variation lower bounds the variation bound and hence
algorithms with regret bounded by gradual variation are more adaptive to regular patterns
than algorithm with bounded variational regret bounds.

2.2.4

Bandit Online Convex Optimization

In bandit OCO, once the online learner commits to the decision wt at round t, he does
not have access to the function ft (·) chosen by adversary and instead receives the scalar loss
ft (wt ) he suﬀers at point wt . In the optimization community, this problem usually known as
zeroth-order or derivative-free convex optimization as we only have access to function values
to solve the optimization problem [79, 136]. A simple approach for bandit OCO which was
the main dilemma in most existing works is to utilize a reduction to the full information
OCO setting. To do so, one needs to approximate the gradient of the loss functions at each
iteration based on the observed scalar loss and feed it to the full information algorithm.
This reduction has been illustrated in Figure 2.3. A simple idea to estimate the gradients
has been utilized in [54] and a modified gradient descent approach for bandit OCO has been
presented that attains O(T 3/4 ) regret bound. The key idea of their algorithm is to compute
the stochastic approximation of the gradient of cost functions by single point evaluation
of the cost functions. The main observation was that one can estimate the gradient of a
function f (w) by taking a random vector u from unit sphere S = {x ∈ Rd ; ∥x∥ = 1} and
54

ft (wt )
.
Adversary

gt
Bandit OCO

wt+1 ∈ W

Full

OCO

zt+1 ∈ (1 − ξ)W

Figure 2.3: Reduction of bandit online convex optimization to online convex optimization
with full information. The full OCO needs to play from a shrinked domain (1 − ξ)W to
ensure that the sampled points belong to the domain.
ˆ = f (w + δu)u. Then E[ˆ
scaling it by f (w + δu), i.e. g
g] is proportional to the gradient of a
smoothed version of f (·) defined as fˆ(w) = Ev∈B [f (w + δv)] for v random from the unit ball
B = {x ∈ Rd ; ∥x∥ ≤ 1}. To ensure that the sampled query points belong the domain W, the
OGD is run over a shrinked domain (1 − ξ)W where we further assume that rB ⊆ W ⊆ RB.
Expected Online Gradient Descent (EOGD)
Input: convex set W, step size η > 0, δ, r, and ξ
Initialize: z0 = 0
for t = 1, 2, . . . , T
Pick a random unit vector ut uniformly at random
Play wt = zt + δut and observe ft (wt )
(
)
Update zt+1 = Π(1−ξ)W zt − ηft (wt )ut
end for

Theorem 2.15 (Regret Bound for EOGD). Let f1 , f2 , . . . , fT be a sequence of convex,
diﬀerentiable functions defined over the convex domain W ⊆ Rd where rB ⊆ W ⊆ RB. Let
g1 , . . . , gn are vector-valued random variables with E[gt |wt ] = E[∇ft (wt )] and ∥gt ∥ ≤ G,

55

√ , ξ = δ/r, and δ = T −1/4
for some G > 0. Then, for η = GR
n

E

[∑
T

√

RdGr
3(rG+C)

]
T
∑
ft (wt ) − min
ft (w) ≤ O(T 3/4 )

t=1

w∈W

t=1

This regret bound is later improved to O(T 2/3 ) [9] for online bandit linear optimization. More recently, [46] proposed an ineﬃcient algorithm for online bandit linear optimiza√
tion with the optimal regret bound O(poly(d) T ) based on multi-armed bandit algorithm.
The key disadvantage of [46] is that it is not computationally eﬃcient. Abernethy, Hazan,
and Rakhlin [2] presented an eﬃcient randomized algorithm with an optimal regret bound
√
O(poly(d) T ) that exploits the properties of self-concordant barrier regularization.
For general online bandit convex optimization, [5] proposed optimal algorithms in a
multi-point bandit setting, in which multiple points can be queried for the cost values. With
√
multiple queries, they showed that the EOGD algorithm can give an O( T ) expected regret
bound. The key idea of multiple point bandit online convex optimization, proposed in [5], is
to approximate the gradient using two point function evaluations. More specifically, at each
iteration t we randomly choose a unit direction ut and measure the function values at points
wt + δut and w − δut , i.e. ℓt (wt + δut ) and ℓt (wt − δut ), where δ > 0 is a small perturbation
that is O(1/T ). Given two point function evaluation, we approximate the gradient ∇ft (wt )
d (f (w + δu ) − f (w − δu )) u . The nice property of this sampling strategy is
by gt = 2δ
t t
t
t t
t
t

that the norm of sampled gradient is no longer dependent on δ, i.e., ∥gt ∥ ≤ dG, and yield
the same regret bound as OGD.

56

2.2.5

From Regret to Risk Bounds

So far we have dealt with two diﬀerent models for learning: statistical and sequential or
online. The online setting is in stark contrast to the statistical setting in few aspects. First,
in statistical learning or the batch model there is a strict division between the training phase
and the testing phase. In contrast, in the online model, training and testing occur together
all at the same time since every example acts both as a test of what was learned in the past
and as a training example for improving our predictions in the future. This requires that
the online learner to be adaptive to the environment.
A second key diﬀerence relates to the generation of examples. Statistical learning scenario
follows the key assumption that the distribution over data points is fixed over time, both
for training and test points, and samples are assumed to be drawn i.i.d. from an underlying
distribution D. Furthermore, the goal is to learn a hypothesis with a small expected loss or
generalization error. In contrast, in online setting there is no notion of generalization and
algorithms are measured using a mistake bound model and the notion of regret, which are
based on worst-case or adversarial assumption where the adversary deliberately trying to
ruin the learners performance.
Finally, we distinguish between the processing model of statistical and online learning
settings. Online algorithms process one sample at a time and can thus be significantly more
eﬃcient both in time and space and more practical than batch algorithms, when processing
modern data sets of several million or billion points. hence, these algorithms are more
suitable for large scale learning. This stands in contrast to statistical learning algorithms
such as ERM and it would be tempting to switch to online learning algorithm.
Given the close relationship between these two settings and clear advantage of online

57

learning from computational viewpoint, a paramount question is ”whether or not algorithms
developed in the sequential setting can be used for statistical learning with guaranteed generalization bound? ”. More precisely, can we devise algorithms that exhibits the desirable
characteristics of online learning but also has good generalization properties. Since a regret
bound holds for all sequences of training samples, it also holds for an i.i.d. sequence. What
remains to be done is to extract a single hypothesis out of the sequence produced by the
sequential method, and to convert the regret guarantee into a guarantee about generalization. Such a process has been dubbed an Online-to-Batch Conversion and foreshadows
a key achievement, which is any online learning algorithm with sub linear regret can be
converted into a batch algorithm. Here we introduce two methods to convert an online algorithm that attains low regret into a batch learning algorithm that attains low risk. Such
online-to-batch conversions are interesting both from the practical and the theoretical perspectives [36, 48, 83].
Formally, let f1 , f2 , · · · , fT be an i.i.d sequence of loss functions, ft : W → R. In
statistical setting one can think of each loss function ft (w) as ft (w) = ℓ(w, (xt , yt )) for a
fixed loss function ℓ : W ×Ξ → R+ and a random instance (xt , yt ) ∈ Ξ sampled following the
underlying distribution D. We feed these loss functions to an online learning algorithm A
and assume that the online learner produces a sequence w1 , w2 , · · · , wT ∈ W of hypothesis.
ˆ ∈ W with small generalization error. Here
The goal is to construct a single hypothesis w
we consider two solutions for this problem: randomized conversion and averaging.
The simplest conversion scheme is to simply choose a random hypothesis uniformly at
random from the sequence of hypothesis w1 , w2 , · · · , wT . At the first glance this idea seems
naive, but it has few desirable properties. First, the average loss of the algorithm is an
ˆ i.e., E[ℓ(w)]
ˆ = (1/T )
unbiased estimate of the expected risk of w,
58

∑T

t=1 ft (wt ). Second, the

conversion is applicable regardless of any convexity assumption. Finally, in expectation, the
ˆ is upper bounded by the average per-round regret of online learner, i.e.,
excess loss of w
[
]
E Regret(A, W, F, T )
ˆ − min LD (w) ≤
LD (w)
.
T
w∈W

An alternative solution which is only applicable to learning from convex loss functions
ˆ = (1/T )
over convex hypothesis spaces, is to output the average solution w

∑T

t=1 wt . This

conversion is also enjoys the same properties as the randomized conversion with an additional
important feature. That is, we can able to show high-probability bounds on the excess risk
provided that loss functions are bounded.

2.3

Convex Optimization

A generic convex optimization problem may be written as

min f (w) subject to w ∈ W,

where f : Rd → R chosen from a specific family of functions F is a proper convex function,
and W ⊆ Rd is nonempty, compact, and convex set which is also called the constraint or
feasible set. We denote by w∗ the optimal solution to above problem and assume that it
exists, i.e., w∗ = arg minw∈W f (w). Ideally, the goal of an optimization algorithm is to
compute the optimal solution, but almost always it is impossible to compute an exact w∗
in finite time. hence, we turn to find an ϵ-approximate solution. A solution w ∈ W is an ϵ
sun-optimal if f (w) − minw′ ∈W f (w′ ) ≤ ϵ.
For a given family F of convex functions over the feasible set W, our primary focus is to
59

determine the eﬃciency of an optimization procedure to produce sub-optimal solutions. To
analyze the eﬃciency of convex optimization algorithm one typically follows the oracle model
of optimization which lies in the heart of the complexity theory of convex optimization [119,
121].

2.3.1

Oracle Complexity of Optimization

A typical convex optimization procedure initially picks some point in the feasible convex set
W and iteratively updates these points based on some local information about the function
it calculates around these successive points. The method can decide which points to query at
based on the results of earlier queries, and tries to use as few queries as possible to achieve
its task. The crucial question we are interested to answer about a specific optimization
problem is the number of queries the algorithm makes to find an ϵ-accurate solution. The
oracle complexity is a general model to analyze the computational complexity of optimization
algorithms.
In the oracle model, there is an oracle O and an information set I. The oracle O is
simply a function ψ : W → I that for any query point w ∈ W returns an output from I.
The information set provided to the algorithms varies depending on the type of the oracle.
In particular, a zero-order oracle returns f (w) for a given query w ∈ W, first-order oracle
returns gradient I = {∇f (w)} (respectively a sub-gradient I = {g ∈ ∂f (w)} if the function
is not diﬀerentiable), and a second-order oracle return the Hessian at the queried point. We
also distinguish between noisy (or stochastic) and exact (or deterministic) oracle models. In
the noisy oracle model, the information returned by the oracle are corrupted with zero-mean
noise with bounded variance.
The algorithm iteratively updates the solution based on the information accumulated in
60

Oracle

Lipschitz Lipschitz & Strongly

Deterministic
Stochastic

ρ
ϵ2
ρ
ϵ2

ρ2
αϵ
ρ2
α2 ϵ

Smooth
β
√
ϵ
β
ρ
ϵ + ϵ2

Smooth & Strongly
√
κ log αϵ
( )
√
1
κ log βϵ + αϵ

Table 2.1: Lower bound on the oracle complexity for stochastic/deterministic first-order optimization methods. Here ρ, α, and β are the Lipschitzness, strong convexity, and smoothness
parameters, respectively. The parameter κ is the condition number of function and is defined
as κ = β/α.

previous iterations. In particular, in optimization with zero and first-order exact oracle model
which is the main focus of large scale optimization methods, an optimization method updates
the solution using wt = ϕt (w0 , . . . , wt−1 , ∇f (w0 ), . . . , ∇f (wt−1 ), f (w0 ), . . . , f (wt−1 )) where
ϕt : × ∪ts=1 Is → W is updating mechanism utilized by the optimization algorithm at iteration t to determine the next query point wt . Roughly speaking, we measure complexity of
an algorithm by the number of queries that it makes to a prescribed oracle for computing
the final solution.
Given a positive integer T corresponding to the number of iterations, the minimax oracle
optimization error after T steps, over a set of functions F, is defined as follows:
(
)
OracleComplexity(F, W, O, T ) = inf sup f (wT ) − inf f (w) .
ψ f ∈F

w∈W

In other words, the minimax oracle complexity is the best possible rate of convergence
(as a function of the number of queries) for the optimization error when one restricts to
black-box procedures in order to guarantee delivering an ϵ-accurate solution to any function
f ∈ F.
A large body of literature is devoted to obtaining rates of convergence of specific pro-

61

cedures for various set of convex functions F of interest (essentially smooth/non-smooth,
and strongly convex/non-strongly convex) and diﬀerent types of oracles (essentially noisy
or stochastic/deterministic or exact, zero order or derivative free, first order, and second
order). The oracle complexity of first-order deterministic and stochastic oracle models are
summarized in Table 2.1 for diﬀerent family of loss functions elicited from [121] for deterministic and from [4, 119] for stochastic optimization. The algorithms which attain these
lower bounds will be discussed later.

2.3.2

Deterministic Convex Optimization

Here we briefly review the optimization algorithms in the first-order oracle model which
are called gradient based methods for simplicity. More precisely, we assume that the only
information the optimization methods can learn about the particular problem instance is the
values and derivatives of these components (f (w), ∇f (w)) at query points w ∈ W. Recently,
first-order methods have experienced a renaissance in the design of fast algorithms for largescale optimization problems. This is due the fact that although higher order methods such
as interior point methods [118] have linear convergence rate, but this fast rate comes at
the cost of more expensive iterations, typically requiring the solution of a system of linear
equations in the input variables. Consequently, the cost of each iteration typically grows at
least quadratically with the problem dimension, making interior point methods impractical
for very-large-scale convex programs.
The convergence rate of gradient based methods usually depends on the properties of
the objective function to be optimized. When the objective function is strongly convex and
smooth, it is well known that gradient descent methods can achieve a geometric convergence
rate [33]. When the objective function is smooth but not strongly convex, the optimal
62

convergence rate of a gradient descent method is O(1/T 2 ), and is achieved by the Nesterov’s
methods [93]. For the objective function which is strongly convex but not smooth, the
convergence rate becomes O(1/T ) [134]. For general non-smooth objective functions, the
√
optimal rate of any first order method is O(1/ T ). Although it is not improvable in general,
recent studies are able to improve this rate to O(1/T ) by exploring the special structure of
the objective function [123, 122]. In addition, several methods are developed for composite
optimization, where the objective function is written as a sum of a smooth and a non-smooth
function [94, 93, 96]. The proof of coming results can be found in [121] and in the reference
papers.

2.3.2.1

Gradient Descent Method

Perhaps the simplest and most intuitive algorithm for deterministic optimization is gradient
decent (GD) method which which was proposed by Cauchy in 1846 [35] 1 . To find a solution
within the domain W that optimizes the given objective function f (w), GD computes the
gradient of f (w) by querying a first-order deterministic oracle, and updates the solution by
moving it in the opposite direction of the gradient. To ensure that the solution stays within
the domain W, GD has to project the updated solution back into the W at every iteration.
1 The original Cauchy’s algorithm uses the direction that descends most and the best step-size which
convergences slowly. Afterwards, a lot of researches have been done on how to choose the step-size for more
eﬃcient algorithms [64, 16]

63

Projected Gradient Descent (GD)
Input: convex set W, η > 0, function f ∈ F, first-order oracle O
Initialize: w1 ∈ W
for t = 1, 2, . . . , T
Query the oracle O at point wt to get ∇f (wt )
Update wt+1 = ΠW (wt − η∇f (wt ))
end for

Theorem 2.16 (Convergence Rate of GD). Assume that f ∈ F be a convex function defined
over the convex domain W ⊆ Rd . Let w∗ = arg minw∈W f (w) be the optimal solution. Then,
for the convergence rate of GD algorithm
√ we have
• if f be ρ-Lipschitz, by setting η = R
ρ t




T
∑
1
ρ∥w∗ − w1 ∥
√
f
.
wt  − f (w) ≤
T
T
t=1
• if f be β-smooth by setting η = β1 we have

f (wT ) − f (w∗ ) ≤

2β∥w∗ − w1 ∥2
.
T

β
• if f be β-smooth and α-strongly convex, and κ = α
be the condition number of f , by

64

2 we have
setting η = α+β

β
f (wT ) − f (w∗ ) ≤ ∥w1 − w∗ ∥2
2

(

)
κ−1 T
.
κ+1

By comparing the rates obtained in Theorem 2.16 to the lower bounds in Table 2.1, one
can realize that the GD obtains the optimal bound only for Lipschitz functions. We also
note that by examining the bounds it turn out that the GD method is independent of the
dimension of the convex domain W as long as the Euclidean norm of solutions and gradients
are independent of the ambient dimension of convex domain W which makes it attractive for
optimization in high dimension. The computational bottleneck of the projected GD is often
the projection step which is a convex optimization problem by itself and might be expensive
for many domains. In Chapter 9 we propose eﬃcient optimization methods which do not
require intermediate projection steps.

2.3.2.2

Accelerated Gradient Descent Method

The convergence rate of GD method for optimization smooth loss functions is O(1/T ) which
is far away from the lower bound O(1/T 2 ) discussed before. Nesterov showed in 1983 that
we can improve the convergence rate of GD without using anything more than gradient
information at various points of the domain. Accelerated GD [120, 121, 123] bridges the
gap between the lower bound for smooth optimization and lower bound provided by oracle
complexity with a simple twist of GD method and is able to obtain the optimal O(1/T 2 )
convergence rate for minimizing smooth functions.

65

Accelerated Gradient Descent (AGD)
Input: η > 0, function f ∈ F, first-order oracle O
Initialize: w0 = z0 = 0, λ0 = 0
for t = 1, 2, . . . , T
Query the oracle O at point wt to get ∇f (wt )
√
)
(
1−η
1
2
Set ηs = 2 1 + 1 + 4ηt−1
, and γt = η t .
t+1

Update zt+1 = zt − β1 ∇f (wt )
Update wt+1 = (1 − γt )zt+1 + γt zt
end for

The following theorem shows that AGD achieves an O(1/T 2 ) convergence rate which is
tight.
Theorem 2.17 (Convergence Rate of AGD). Let f ∈ F be a convex and β-smooth function
and w∗ be the optimal solution. Then the accelerated gradient descent outputs a solution
which satisfies:
f (zT ) − f (w∗ ) ≤
2.3.2.3

2β∥w1 − w∗ ∥2
.
T2

Mirror Descent Method

Mirror Descent (MD) is a first-order optimization procedure which generalizes the classic GD
method to non-Euclidean geometries by relying on a distance generating function specific to
the geometry. The original MD algorithm was developed to perform the gradient descent in
spaces where the gradient only makes sense in the dual space. In this cases, the MD first
maps the point wt into a dual space by mapping Φ, then performs the gradient update in
66

the dual space, and finally maps the resulting point back to the primal space. When the
mapping Φ(w) = 12 ∥w∥2 then the primal and dual spaces are same and the MD performs a
simple gradient descent.
Mirror Descent (MD)
Input: η > 0, function f ∈ F, first-order oracle O
Initialize: w0 = 0
for t = 1, 2, . . . , T
Query the oracle O at point wt to get ∇f (wt )
Update ∇Φ(zt+1 ) = ∇Φ(wt ) − η∇f (wt )
Update wt+1 = argminw∈W∩K BΦ (w, zt+1 )
end for

Theorem 2.18 (Convergence Rate of MD). Let Φ be a mirror map. Assume also that Φ is
α-strongly convex on W ∩ K with respect to ∥ · ∥. Let R = supw∈K∩W Φ(w) − Φ(w1 ) and f
√
ρ
be convex and ρ-Lipschitz w.r.t. ∥ · ∥, then MD algorithm with η = R 2α
T satisfies
(

)
T
1∑
f
wt − min f (w) ≤ ρR
T
w∈W
t=1

√

2
.
αT

The MD algorithm can alternatively be expressed as nonlinear projected sub-gradient
type method, derived from a general distance generating function (Bregmen divergence in
Definition ??) instead of the usual Euclidean squared distance as [17]:
}
{
1
wt+1 = min ⟨w, ∇f (wt )⟩ + BΦ (w, wt ) .
η
w∈W
67

(2.11)

Remark 2.19. In terms of convergence rate, the MD obtains the same rate as GD method
but MD has advantage by exploiting the geometry of convex domain. More specifically, since
MD method adapts to the structure of domain W via mapping Φ, it has less dependency
on the dimensionality of the domain which could be appealing for large scale optimization
problems. As an example, it is easy to verify that for optimization over simplex, i.e., ∆ =
{w ∈ Rd++ :

∑

i wi = 1} by using the negative entropy Φ(w) =

∑d

i=1 log wi as the mapping

function, the dependency of MD to d is in order of log d, while regular GD algorithm has a
linear O(d) dependency.

2.3.2.4

Mirror Prox Method

In the black-box oracle model the algorithm has access to the values and gradients of function,
without knowing the structure of the objective function. But in many circumstances we never
meet a pure black box model and have some information about the structure of the underlying
function. Intestinally, the proper use of the structure of the problem can help to obtain better
convergence rate for specific family of loss functions [122, 123, 117]. In particular, in [117]
it been shown that for non-smooth Lipschitz continuous functions which admit a smooth
saddle-point representation one can obtain a rate of convergence of order O(1/T ) with a
properly designed gradient descent method, despite the fact that the original function is
√
non-smooth and can not be optimized with a convergence rate better then O(1/ T ) in
black-box model. As an example consider the function f to be optimized is of the form
f (w) = max1≤i≤n fi (w) where each individual functions fi (w), i ∈ [n] is convex, β-smooth
and ρ-Lipschitz in some norm ∥ · ∥. In this case the function f (w) is not smooth and the
√
best convergence rate one can hope in the black-box model is O(1/ T ).
Let Φ : K → R be a mirror map on W and let w1 ∈ argminw∈W∩K Φ(w). The mirror
68

prox (extragradient in a specialized case) method is detailed below.
Extra Gradient Descent Method (EGD)
Input: η > 0, function f ∈ F , first-order oracle O
Initialize: w1 = z1 = 0
for t = 1, 2, . . . , T
Query the oracle O at point wt to get ∇f (wt )
Update ∇Φ(z′t+1 ) = ∇Φ(wt ) − η∇f (wt )
Update zt+1 ∈ argminz∈W∩K BΦ (z, z′t+1 ) and query the oracle to get ∇f (zt+1 )
′ ) = ∇Φ(w ) − η∇f (z
Update ∇Φ(wt+1
t
t+1 )
′ )
Update wt+1 ∈ argminw∈W∩K BΦ (w, wt+1

end for

The EGD method first makes a step of MD to go from wt to zt+1 , and then it makes
a similar step to obtain wt+1 , starting again from wt but this time using the gradient
of f evaluated at zt+1 . The following theorem exhibits the rate of convergence for EGD
algorithm.
Theorem 2.20 (Convergence Rate of EGD). Let Φ be a α-strongly convex on K ∩ W with
respect to ∥ · ∥. Let R = supw∈K∩W Φ(w) − Φ(w1 ) and f be convex and β-smooth w.r.t.
∥ · ∥. Then EGD with η = α
β has a convergence rate as:
(

)
T
1∑
βR2
.
f
zt − min f (w) ≤
T
αT
w∈W
t=1

69

2.3.2.5

Conditional Gradient Descent Method

The main computation bottleneck of gradient descent methods in solving constrained optimization problems is the projection step which might be as hard as solving the original
optimization problem. Surprisingly the projection step can be avoided by replacing the expensive projection operation with other kinds of light computational operations. One such
an example is the Conditional Gradient Descent (CGD) method which is also known as
Frank-Wolf algorithm. The Frank-Wolfe method, that was originally introduced in a paper
by Frank and Wolfe from the 1950 [56], where they aimed to present an algorithm for minimizing a quadratic function over a polytope using only linear optimization steps over the
feasible set.
The CGD algorithm proceeds by iteratively solving a linear optimization problem to find
a direction pt inside the domain W and updating the solution as a linear combination of
the obtained direction and previous solution. This procedure guarantees that the updated
solutions remain inside the feasible domain W. This method replaces the projection step
with a linear optimization problem over the constrained domain which is more eﬃcient as
long as the linear problem is easy to be solved.

70

Conditional Gradient Descent (CGD)
Input: convex set W, η > 0, a smooth convex function f ∈ F
Initialize: w1 ∈ W
for t = 1, 2, . . . , T
Find pt = arg minp∈W ⟨∇f (wt ), p⟩
Update wt+1 = (1 − ηt )wt + ηt pt
end for

The following result shows the convergence rate of CGD for smooth functions.
Theorem 2.21 (Convergence Rate of CGD). Assume that f ∈ F be a β-smooth convex
function with respect to some norm ∥ · ∥ defined over the convex domain W. Let R =
2 in CGD method, we have:
supw,w′ ∥w − w′ ∥. Then by setting ηt = t+1

f (wT ) − f (w∗ ) ≤

2βR2
t+1

In Chapter 9 we will show that by replacing the projection step with gradient computation
of constrain function, it is possible to devise eﬃcient stochastic optimization methods which
only require a single projection at the final iteration.

2.3.3

Stochastic Convex Optimization

So far we assumed the the optimization algorithm has access to a noiseless oracle. It is more
realistic to consider noisy oracles, where one does not have access to exact objective function
or gradient values, but rather to their noisy estimates (usually zero mean and bounded
71

variance). In particular, for a fixed closed convex subset W ⊂ Rd of Rd we consider the
following optimization problem:
∫
min f (w) for f (w) = E[F (w, ξ)] =

w∈W

F (w, ξ)dP (ξ),

(2.12)

Ξ

where we assume that the expected value function f (w) is continuous and convex on W.
We note if the function F (w, ξ) be convex on W, then it follows that f (w) is also convex
and the problem becomes a convex programming problem. The main diﬃculty in solving the
stochastic optimization problem in (2.12) is that the multidimensional integral (expectation)
cannot be computed with a high accuracy [115], and in statistical learning problems we
usually do not know what the distribution P is. Therefore, there are two solutions to address
this issue: these are stochastic approximation (SA) and the sample average approximation
(SAA) methods. The main idea of SAA approach to solving stochastic programs is as follow.
A sample ξ1 , ξ2 , · · · , ξn of n realizations of the random vector in objective is generated and the
stochastic objective is approximated by the sample average function. Then, a deterministic
optimization algorithm is applied to solve the approximate function. We note that we can
not perform a gradient descent on f (w) as we would need to know the underlying distribution
to compute a gradient of f (w)
In SA we assume that there is an stochastic oracle O, which, for a given point (w, ξ) ∈
W × Ξ returns an unbiased estimates of subgradient of f (w). In other words, it returns g
such that E[g] ∈ ∂f (w). Stochastic optimization methods allow the optimization method
to take a step which is only in expectation along the negative of the gradient. Based on this
oracle, a simple algorithm to optimize the objective is Stochastic Gradient Descent (SGD).
SGD is in the same spirit of GD but it replaces the true gradients with stochastic gradients

72

in updating the solutions:
Stochastic Gradient Descent (SGD)
Input: convex set W, η > 0, function f ∈ F , stochastic first-order oracle O
Initialize: w0 = 0
for t = 1, 2, . . . , T
Query the stochastic oracle O at point wt to get gt where E[gt ] ∈ ∂f (wt )
Update wt+1 = ΠW (wt − ηgt )
end for
∑
ˆ = T1 Tt=1 wt
Return: w

Under mild conditions as outlined below, one con show the SGD algorithm convergence
√
to the optimal solution with convergence rate O(1/ T ) with a high probability:

Eξt [gt ] = ∇f (w)
Eξt [exp(∥gt − ∇f (w)∥2∗ /σ 2 )] ≤ exp(1).
It is also straightforward to generalize the the mirror descent method to stochastic setting
by replacing the Euclidean distance in the update of SGD with another Bregman divergence
adopted to a specific domain.
By comparing SGD method for stochastic optimization and OGD method for regret
minimization, we note that both methods are closely related. Although both methods looks
similar algorithmically, but there are main conceptual diﬀerences between SGD and OGD.
We note that in stochastic optimization the goal is to generate a sequence of solutions which

73

quickly convergences to the minimum of a function defined as f (w) = E[F (w, ξ)], while
is online learning the goal is to generate a sequence of solutions that accumulates a small
loss during the learning measured in terms of regret. In other words SGD provides an
incremental solution to a stochastic optimization problem and OGD provides a solution to
adopt a sequence of adversarially generated loss functions. We note that regret minimization
algorithms equipped with online to batch conversion schemas discussed before settle an
eﬃcient paradigm to solve general optimization problems, but sometimes it seems essential
to go beyond this barrier to obtain optimal convergence rates in stochastic setting [72, 125].
Remark 2.22. It is remarkable that in stark contrast to deterministic optimization where
the smoothness of objective function makes a significant improvement in terms of convergence rate (i.e., Theorem 2.16), in stochastic optimization the smoothness is not a desirable
property as it yields the same convergence rate as the Lipschitz functions. In particular, as it
has been shown in Appendix A.3, a tight analysis of stochastic mirror descent algorithm has
2

σR
√
an O( βR
T + T ) convergence rate for smooth objective functions, which is dominated by the
√
slow O(1/ T ) rate unless the variance of stochastic gradients becomes zero σ = 0. As it will

be discussed in Chapter 7, the mixed optimization paradigm we introduce in thesis is able to
leverage the smoothness of objection function to attain an O(1/T ) rate by accessing the full
gradient oracle log T times on top of the O(T ) accesses of the stochastic gradient oracle.

2.3.4

Convex Optimization for Learning Problems

Formulating statistical learning tasks and in particular convex learning problems as a convex
optimization problem makes an intimate connection between learning and mathematical
optimization. Therefore, optimization methods play a central role in solving machine learning

74

problems and challenges exist in machine learning applications demand the development of
new optimization algorithms.
To see this, consider the typical problem of the supervised learning consisting of a input
space Ξ = X × Y and a suitable set of hypotheses W for prediction such as the set of
linear predictors, i.e., W = {x → ⟨w, x⟩ : w ∈ Rd }. Then, the learner is provided with
a training sample S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) ∈ Ξn and is supposed to pick a
hypothesis w ∈ W which minimizes appropriate empirical cost over the training sample
based on a predefined surrogate loss function ℓ : W × Ξ → R+ . The last step of this learning
process corresponds to an optimization algorithm that solves the minimization problem of
picking that hypothesis from the set of hypotheses. As a result, convex optimization forms
the backbone of many algorithms for statistical learning. This formulation includes support
vector machine (SVM), support vector regression (SVR), Lasso, logistic regression, and ridge
regression among many others as detailed below:
• Hinge loss (Support Vector Machine)):

n
∑

max(0, 1 − yi ⟨w, xi ⟩).

i=1

• Logistic loss (Logistic Regression): min

w∈W

• Least-squares loss (Regression): min

w∈W

log(1 + exp(−yi ⟨w, xi ⟩)).

i=1

n
∑

w∈W

• Exponential loss (Boosting): min

n
∑

(yi − ⟨w, xi ⟩)2 .

i=1

n
∑

exp(−yi ⟨w, xi ⟩).

i=1

The domain W in above formulations, captures the constrains on the classifier w. Commonly considered examples are the bounded Euclidean ball W = {w ∈ Rd : ∥w∥2 ≤ R},
bounded ℓ1 ball W = {w ∈ Rd : ∥w∥1 ≤ B} or the box W = {w ∈ Rd : ∥w∥∞ ≤ B}.
We note that instead of moving the constraint into the W, by leveraging on the theory of
75

Lagrangian method in constrained optimization, one can simply move the constraint into
the objective and solve the unconstrained optimization problem, i.e., W = Rd .
To fully understand the application of convex optimization methods to solving machine
learning problems, let us consider the following optimization problem:

1∑
ℓ(w, (xi , yi ))
n
n

min LS (w) =

w∈W

(2.13)

i=1

A preliminary approach for solving the optimization problem in (2.13) is the batch gradient descent (GD) algorithm. It starts with some initial point, and iteratively updates the
solution using the equation wt+1 = ΠW (wt − η∇LS (wt )) where
1∑
∂ℓ(w, (xi , yi ))xi .
n
n

∇LS (w) =

i=1

The main shortcoming of GD method is its high cost in computing the full gradient ∇LS (wt ),
i.e., O(n) gradient computations, when the number of training examples is large. Stochastic
gradient descent (SGD) alleviates this limitation of GD by sampling one (or a small set of)
examples and computing a stochastic (sub)gradient at each iteration based on the sampled
examples. Since the computational cost of SGD per iteration is independent of the size of the
data (i.e., n), it is usually appealing for large-scale learning and optimization [29, 115, 134].
Despite of their slow rate of convergence compared with the batch methods, stochastic
optimization methods have shown to be very eﬀective for large scale and online learning
problems, both theoretically [115, 94] and empirically [134]. We note although a large number
of iterations is usually needed to obtain a solution of desirable accuracy, the lightweight

76

computation per iteration makes SGD attractive for many large-scale learning problems.

2.3.5

From Stochastic Optimization to Convex Learning Theory

As mentioned earlier, most of existing learning algorithms follow the framework of empirical
risk minimizer or regularized ERM, which was developed to great extent by Vapnik and
Chervonenkis [146]. Essentially, ERM methods use the empirical loss over S, i.e.,
1∑
ℓ(w, (xi , yi )),
n
n

LS (w) =

i=1

as a criterion to pick a hypothesis. From optimization viewpoint, the ERM methods resembles the widely used Sample Average Approximation (SAA) method in the optimization
community when the hypothesis space and the loss function are convex. If uniform convergence holds, then the empirical risk minimizer is consistent, i.e., the population risk of the
ERM converges to the optimal population risk, and the problem is learnable using ERM.
A rather diﬀerent paradigm for risk minimization is stochastic optimization. Recall that
the goal of learning is to approximately minimize the risk

LD (w) = E(x,y)∼D [ℓ(w, (x, y))].

(2.14)

However, since the distribution D is unknown to the learner, we can not utilize standard
gradient methods to directly minimize the expected loss in (2.14). This is because we are
not able to compute the gradient ∇LD (w) at a particular query point w. We note that
this is diﬀerent from the application of SGD for solving the optimization problem in (2.13)
because in (2.13) the randomness is over the uniform sampling from the objective function

77

which is known (essentially we have a randomized optimization method), while in (2.14) the
randomness is imposed on the instance space Ξ through a distribution D which is unknown
to the learner in advance.
In stochastic optimization all we need is not the exact gradient of objective function,
but an unbiased estimate of the true gradient ∇LD (w). Surprisingly, it turns out that the
construction of this unbiased estimate is extremely simple for risk minimization as follows.
First, we sample an instance z = (xi , yi ) ∈ Ξ according to D and set the stochastic gradient
to be
g = ∂ℓ(w, (xi , yi ))xi ,
which will be an unbiased estimate of true gradient, i.e., E[g] = ∇LD (w).
The beauty of SGD for direct risk minimization is that it is eﬃcient and it delivers
the same sample complexity as the ERM method. To motivate stochastic optimization as
an alternative to the ERM method, [132, 131] challenged the ERM method and showed
that there is a real gap between learnability and uniform convergence by investigating nontrivial problems where no uniform convergence holds, but they are still learnable using SGD
algorithm [115]. These results uncovered an important relationship between learnability and
stability, and showed that stability together with approximate empirical risk minimization,
assures learnability [133]. Unlike ERM method in which the learnability is characterized
by attendant complexity of hypothesis space, in SGD based learning, stability is a general
notion to characterize learnability. In particular, in learning setting under i.i.d. samples
where uniform convergence is not necessary for learnability, but where stability is both
suﬃcient and necessary for learnability.

78

Chapter 3
Passive Learning with Target Risk
The setup of this chapter will be in the classical statistical learning setting discussed in
Chapter 2, but with a slight modification. In particular, we assume that the target expected
loss, also referred to as target risk, is provided in advance for learner as prior knowledge.
Unlike most studies in the learning theory that only incorporate the prior knowledge into
the generalization bounds, we are able to explicitly utilize the target risk in the learning
process. By leveraging on the smoothness of loss function, our analysis reveals a surprising
result on the sample complexity of learning: by exploiting the target risk in the learning
algorithm, we show that when the loss function is both smooth and strongly convex, the
( )
sample complexity reduces to O(log 1ϵ ), an exponential improvement compared to the
sample complexity O( 1ϵ ) for learning with strongly convex loss functions. Furthermore, our
proof is constructive and is based on a computationally eﬃcient stochastic optimization
algorithm, dubbed ClippedSGD, for such settings which demonstrate that the proposed
algorithm is practically useful.
The remainder of the chapter is organized as follows: Section 3.1 motivates the problem
and setups the notation. Section 3.2 motivates the main intuition behind the proposed algorithm. The proposed ClippedSGD algorithm and main result on its sample complexity are
discussed in Section 3.3. The proof of logarithmic sample complexity is given in Section 3.4
and the omitted proofs are deferred to Section 3.5. Section 3.6 summarizes the chapter and
Section 3.7 surveys the related works.
79

3.1

Setup and Motivation

Recall that in the standard statistical or passive supervised learning setting, we consider an
input space Ξ ≡ X × Y where X ⊆ Rd is the space for instances and Y is the set of labels,
and a hypothesis class H from which we choose a classifier. We assume that the domain
space Ξ is endowed with an unknown probability measure D and measure the performance
of a specific hypothesis h by defining a nonnegative loss function ℓ : H × Ξ → R+ . The risk
of a hypothesis h with respect to the underlying distribution D is defined as:

LD (h) = Ez∼D [ℓ(h, z)].

Given a sample S = (z1 , · · · , zn ) = ((x1 , y1 ), · · · , (xn , yn )) ∼ Ξn , the goal of a learning
algorithm is to pick a hypothesis h : X → Y from H in such a way that its risk LD (h) is
close to the minimum possible risk of a hypothesis in H.
In the new setting we consider for learning here, we assume that before the start of the
learning process, the learner has in mind a target expected loss, also referred to as target
risk, denoted by ϵprior 1 , and tries to learn a classifier with the expected risk of O(ϵprior )
by labeling a small number of training examples. We further assume the target risk ϵprior is
feasible, i.e., ϵprior ≥ ϵopt where ϵopt = minh∈H LD (h). To address this problem, we develop
an eﬃcient algorithm, based on stochastic optimization, for passive learning with target risk.
The most surprising property of the proposed algorithm is that when the loss function is
both smooth and strongly convex, it only needs O(d log(1/ϵprior )) labeled examples to find
a classifier with the expected risk of O(ϵprior ), where d is the dimension of data. This is a
1 We use ϵ
prior instead of ϵ to emphasize the fact that this parameter is known to the learner in advance.

80

significant improvement compared to the sample complexity for empirical risk minimization.
We note that the target risk assumption is fully exploited by the learning algorithm and
stands in contrast to all those assumptions such as the nature of unknown distribution D,
sparsity, and margin that usually enter into the generalization bounds and are often perceived
as a rather crude way to incorporate such assumptions.
The key intuition behind the ClippedSGD algorithm is that by knowing target risk as
prior knowledge, the learner has better control over the variance in stochastic gradients,
which contributes mostly to the slow convergence in stochastic optimization and consequentially large sample complexity in passive learning. The trick is to run the stochastic
optimization in multiple stages with a fixed size and decrease the variance of stochastically
perturbed gradients at each iteration by a properly designed mechanism. Another crucial
feature of the proposed algorithm is to utilize the target risk ϵprior to gradually refine the
hypothesis space as the algorithm proceeds. Our algorithm diﬀers significantly from standard stochastic optimization algorithms and is able to achieve a geometric convergence rate
with the knowledge of target risk ϵprior .
To analyze the sample complexity of ClippedSGD algorithm, we pursue the stochastic optimization viewpoint for risk minimization detailed in Chapter 2. Precisely, we focus on the
convex learning problems for which we assume that the hypothesis class H is a parametrized
convex set H = {hw : x → ⟨w, x⟩ : w ∈ Rd , ∥w∥ ≤ R} and for all z = (x, y) ∈ Ξ, the loss
function ℓ(·, z) is a non-negative convex function. Thus, in the remainder we simply use vector w to represent hw , rather than working with hypothesis hw . We will assume throughout
that X ⊆ Rd is the unit ball so that ∥x∥ ≤ 1. Finally, the conditions under which we can get
the desired result on sample complexity depend on analytic properties of the loss function.
In particular, we assume that the loss function is strongly convex and smooth as defined in
81

Chapter 2 and can be found in Appendix ??. We would like to emphasize that in our setting,
we only need that the expected loss function LD (w) be strongly convex, without having to
assume strong convexity for individual loss functions.

3.2

The Curse of Stochastic Oracle

We begin by discussing stochastic optimization for risk minimization, convex learnability,
and then the main intuition that motivates the proposed algorithm.
As mentioned earlier in Chapter 2, most existing learning algorithms follow the framework of empirical risk minimizer (ERM) or regularized ERM methods that use the empirical
loss over S, i.e., LS (w) = n1

∑n

i=1 ℓ(w, zi ), as a criterion to pick a hypothesis. In regu-

larized ERM methods, the learner picks a hypothesis that jointly minimizes LS (w) and a
regularization function over w.
A rather diﬀerent paradigm for risk minimization is stochastic optimization. Recall that
the goal of learning is to approximately minimize the risk LD (w) = Ez∼D [ℓ(w, z)]. However,
since the distribution D is unknown to the learner, we can not utilize standard gradient
methods to minimize the expected loss. Stochastic optimization methods circumvent this
problem by allowing the optimization method to take a step which is only in expectation
[
]
along the negative of the gradient. To directly solve minw∈H LD (w) = Ez∼D [ℓ(w, z)] ,
a typical stochastic optimization algorithm initially picks some point in the feasible set H
and iteratively updates these points based on first order perturbed gradient information
about the function at those points. For instance, the widely used SGD algorithm starts with
w0 = 0; at each iteration t, it queries the stochastic oracle Os at wt to obtain a perturbed

82

but unbiased gradient gt and updates the current solution by

wt+1 = ΠH (wt − ηt gt ) ,

where ΠH (·) projects the solution w into the domain H.
To capture the eﬃciency of optimization procedures in a general sense, one can use oracle
complexity of the algorithm which, roughly speaking, is the minimum number of calls to
any oracle needed by any method to achieve desired accuracy [121]. We note that the oracle
complexity corresponds to the sample complexity of learning from the stochastic optimization
viewpoint previously discussed. This viewpoint for learning theory has been taken by few
very recent works [132, 131] where the ERM method has been challenged and it has been
shown that there is a real gap between learnability and uniform convergence. This has been
done by investigating non-trivial problems where no uniform convergence holds, but they
are still learnable using SGD algorithm. These results uncovered an important relationship
between learnability and stability, and showed that stability together with approximate
empirical risk minimization, assures learnability [133]. Unlike ERM method in which the
learnability is characterized by attendant complexity of hypothesis space, in SGD based
learning, stability is a general notion to characterize learnability. In particular, in learning
setting under i.i.d. samples where uniform convergence is not necessary for learnability, but
where stability is both suﬃcient and necessary for learnability.
To motivate the main intuition behind the proposed method, we begin by stating the
following theorem which provides a lower bound on the sample complexity of stochastic
optimization algorithms that is taken from [119].
Theorem 3.1 (Lower Bound on Oracle Complexity). Suppose LD (w) = Ez∼D [ℓ(w, z)]
83

is α-strongly and β-smooth convex function defined over convex domain H. Let Os be a
stochastic oracle that for any point w ∈ H returns an unbiased estimate g, i.e., E[g] =
[
]
∇LD (w), such that E ∥g − ∇LD (w)∥2 ≤ σ 2 holds. Then for any stochastic optimization
algorithm A to find a solution w with ϵ accuracy respect to the optimal solution w∗ , i.e.,
E [LD (w) − LD (w∗ )] ≤ ϵ, the number of calls to Os is lower bounded by
(√
O(1)

β
log
α

(

β∥w0 − w∗ ∥2
ϵ

)

σ2
+
αϵ

)
.

(3.1)

The first term in (3.1) comes from deterministic oracle complexity and the second term
is due to noisy gradient information provided by stochastic oracle Os . As indicated in (3.1),
the slow convergence rate for stochastic optimization is due to the variance in stochastic
(
)
gradients, leading to at least O σ 2 /ϵ queries to be issued. We note that the idea of minibatch [44, 50], although it reduces the variance in stochastic gradients, does not reduce the
oracle complexity.
We close this section by informally presenting why logarithmic sample complexity is, in
principle, possible, under the assumption that target risk is known to the learner A. To this
end, consider the setting of Theorem 3.1 and assume that the learner A is given the prior
accuracy ϵprior and is asked to find an ϵprior -accurate solution. If it happens that the variance
[
]
of stochastic oracle Os has the same magnitude as ϵprior , i.e., E ∥g − ∇LD (w)∥2 ≤ ϵprior ,
then from (3.1) it follows that the second term vanishes and the learner A needs to issue
)
(
only O log 1/ϵprior queries to find the solution. But, since there is no control on the
stochastic oracle Os , except that the variance of stochastic gradients are bounded, A needs
a mechanism to manage the variance of perturbed gradients at each iteration in order to
84

alleviate the influence of noisy gradients. One strategy is to replace the unbiased estimate of
gradient with a biased one, which unfortunately may yield loose bounds. To overcome this
problem, we introduce a strategy that shrinks the solution space with respect to the target
risk ϵprior to control the damage caused by biased estimates.
As an illustrative example to see how the knowledge of target risk is helpful, we consider
a simple one dimensional regression problem with loss function ℓ(w, x) = (wx−b)2 where b is
a random variable that can either be δ or 1 with Pr[b = δ] = 1 − δ 2 . Here we choose δ to be a
very small value δ ≪ 1. The loss function is non-negative, smooth, and strongly convex and is
appropriate for our setting. For this setting we have, ϵopt ≤ Eb [ℓ(0)] = δ 2 ×1+(1−δ 2 )×δ 2 ≤
2δ 2 which can be arbitrarily small. For this example, the solution obtained by ERM with
a small number of training examples will be on order of δ and therefore its expected risk
will be on the order of δ 2 . However, from the viewpoint of the learner, this expected risk is
unknown unless the learner could figure out Pr(b = 1) = δ 2 , which unfortunately requires
an order of 1/δ 2 samples. On the other hand, by having the target feasible risk as prior
knowledge the learner is able to find out Pr(b = 1) with a small number of samples.

3.3

The ClippedSGD Algorithm

In this section we proceed to describe the proposed algorithm and state the main result on
its sample complexity.

3.3.1

The Algorithm Description

We now turn to describing our algorithm. Interestingly, our algorithm is quite dissimilar to
the classic stochastic optimization methods. It proceeds by running the algorithm online on

85

fixed chunks of examples, and using the intermediate hypotheses and target risk ϵprior to
gradually refine the hypothesis space. As mentioned above, we assume in our setting that
the target expected risk ϵprior is provided to the learner a priori. We further assume the
target risk ϵprior is feasible for the solution within the domain H, i.e., ϵprior ≥ ϵopt . The
proposed algorithm explicitly takes advantage of the knowledge of expected risk ϵprior to
(
)
attain an O log(1/ϵprior ) sample complexity.
Throughout we shall consider linear predictors of form ⟨w, x⟩ and assume that the loss
function of interest ℓ(⟨w, x⟩, y) is β-smooth. It is straightforward to see that LD (w) =
E(x,y)∼D [ℓ(⟨w, x⟩, y)] is also β-smooth. In addition to the smoothness of the loss function,
we also assume that LD (w) to be α-strongly convex. We denote by w∗ the optimal solution
that minimizes LD (w), i.e., w∗ = arg minw∈H LD (w), and denote its optimal value by ϵopt .
Let (xt , yt ), t = 1, . . . , T be a sequence of i.i.d. training examples. The proposed algorithm divides the T iterations into the m stages, where each stage consists of T1 training
examples, i.e., T = mT1 . Let (xtk , ykt ) be the tth training example received at stage k, and
let η be the step size used by all the stages. At the beginning of each stage k, we initialize
the solution w by the average solution wk obtained from the last stage, i.e.,
T1
1 ∑
wk =
wkt ,
T1

(3.2)

t=1

where wkt denotes the tth solution at stage k. Another feature of the proposed algorithm
is a domain shrinking strategy that adjusts the domain as the algorithm proceeds using
intermediate hypotheses and target risk. We define the domain Hk used at stage k as

Hk = {w ∈ H : ∥w − wk ∥ ≤ ∆k } ,

86

(3.3)

where ∆k is the domain size, whose value will be discussed later. Similar to the SGD
method, at each iteration of stage k, we receive a training example (xtk , ykt ), and compute
(
)
the gradient gkt = ℓ′ ⟨wkt , xtk ⟩, yt xtk . Instead of using the gradient directly, a clipped version
)
(
of the gradient, denoted by vkt = clip γk , gkt , will be used for updating the solution. More
specifically, the clipped vector vkt ∈ Rd is defined as
( [ ])
[ ] )
([ ] )
(
[vkt ]i = clip γk , gkt
, i = 1, . . . , d
= sign gkt
min γk , gkt
i
i
i

(3.4)

where γk = 2ξβ∆k with ξ ≥ 1. Given the clipped gradient vkt , we follow the standard
framework of stochastic gradient descent, and update the solution by

wkt+1 = ΠH

k

(

)
wkt − ηvkt .

(3.5)

The purpose of introducing the clipped version of the gradient is to eﬀectively control the
variance in stochastic gradients, an important step toward achieving the geometric convergence rate. At the end of each stage, we will update the domain size by explicitly exploiting
the target expected risk ϵprior as

∆k+1 =

√
ε∆2k + τ ϵprior ,

(3.6)

where ε ∈ (0, 1) and τ ∈ (0, 1) are two parameters, both of which will be discussed later.
Algorithm 1 gives the detailed steps for the proposed method. The three important
aspects of Algorithm 1, all crucial to achieve a geometric convergence rate, are highlighted
as follows:
• Each stage of the proposed algorithm is comprised of the same number of training
87

Algorithm 1 ClippedSGD Algorithm
1: Input:
• step size η
• stage size T1
• number of stages m
• target expected risk ϵprior
• parameters ε ∈ (0, 1) and τ ∈ (0, 1) used for updating domain size ∆k
• parameter ξ ≥ 1 used to clip the gradients
2: Initialize: w1 = 0, ∆1 = R, and H1 = H
3: for k = 1, . . . , m do
4:
Set wkt = wk and γk = 2ξβ∆k
5:
for t = 1, . . . , T1 do
6:
Receive training example (xt , yt )
7:
Compute the gradient gkt and
8:
Clip the gradient gkt to vkt using

[vkt ]i = sign
9:
10:
11:
12:
13:
14:

([

gkt

])
i

(
[ ] )
, i = 1, . . . , d
min γk , gkt
i

(
)
Update the solution by wkt+1 = ΠH wkt − ηvkt
k
end for
Update ∆k using (3.6).
Compute the average solution wk+1 according to (3.2)
Shrink the domain Hk+1 using the expression in (3.3).
end for
examples. This is in contrast to the epoch gradient algorithm [72] which divides m
iterations into exponentially increasing epochs, and runs SGD with averaging on each
epoch. Also, in our case the learning rate is fixed for all iterations.
• The proposed algorithm uses a clipped gradient for updating the solution in order to
better control the variance in stochastic gradients; this stands in contrast to the SGD
method, which uses original gradients to update the solution.
• The proposed algorithm takes into account the targeted expected risk and intermediate
hypotheses when updating the domain size at each stage. The purpose of domain
shrinking is to reduce the damage caused by biased gradients that resulted from clipping
88

operation.

3.3.2

Main Result on Sample Complexity

The main theoretical result on the performance of the ClippedSGD algorithm is given in the
following theorem.
Theorem 3.2 (Convergence Rate). Assume that the hypothesis space H is compact and the
loss function ℓ is α-strongly convex and β-smooth. Let T = mT1 be the size of the sample
and ϵprior be the target expected loss given to the learner in advance such that ϵopt ≤ ϵprior
holds. Given ε ∈ (0, 1) and τ ∈ (0, 1), set ξ, η, and T1 as
4β
ξ=
, T1 = 4 max
ατ

{

}
√
ξ 3 βd + 2ξβ d ms 16ξ 2 β 2
1
√ ,
ln
, 2 2
, η=
εα
δ
α ε
2ξβ T1

where
⌈
⌉
ξβR2
s = log2
.
ϵprior

(3.7)

After running Algorithm 1 over m stages, we have, with a probability 1 − δ,
(
)
βR2 m
τ
LD (wm+1 ) ≤
ε + 1+
ϵ
,
2
1 − ε prior
implying that only O(d log[1/ϵprior ]) training examples are needed in order to achieve a risk
of O(ϵprior ).
We note that comparing to the bound in Theorem 3.1, for Algorithm 1 the level of error
to which the linear convergence holds is not determined by the noise level in stochastic
89

gradients, but by the target risk. In other words, the algorithm is able to tolerate the noise
by knowing the target risk as prior knowledge and achieves a linear convergence to the
level of the target risk even when the variance of stochastic gradients is much larger than
the target risk. In addition, although the result given in Theorem 3.2 assumes a bounded
domain with ∥w∥ ≤ R, however, this assumption can be lifted by eﬀectively exploring the
strong convexity of the loss function and further assuming that the loss function is Lipschitz
continuous with constant G, i.e., |LD (w1 ) − LD (w2 )| ≤ G∥w1 − w2 ∥, ∀ w1 , w2 ∈ H.
More specifically, the fact that the LD (w) is α-strongly convex with first order optimality
condition, from Lemma A.14 for the optimal solution w∗ = arg minw∈H LD (w), we have

LD (w) − LD (w∗ ) ≥

α
∥w − w∗ ∥2 , ∀w ∈ H.
2

This inequality combined with Lipschitz continuous assumption implies that for any w ∈ H
the inequality ∥w − w∗ ∥ ≤ R∗ := 2G/α holds, and therefore we can simply set R = R∗ .
We also note that this dependency can be resolved with a weaker assumption than Lipschitz
continuity, which only depends on the gradient of loss function at origin. To this end, we
define |ℓ′ (0, y)| = G. Using the fact that LD (w) is α-strongly, it is easy to verify that
α ∥w ∥2 − G∥w ∥ ≤ 0, leading to ∥w ∥ ≤ R := 2 G and, therefore, we can simply set
∗
∗
∗
∗
α
2

R = R∗ .
We now use our analysis of Algorithm 1 to obtain a sample complexity analysis for
learning smooth strongly convex problems with a bounded hypothesis class. To make it
easier to parse, we only keep the dependency on the main parameters d, α, β, T , and ϵprior
and hide the dependency on other constants in O(·) notation. Let w denote the output of
Algorithm 1. By setting ε = 0.5 and letting c = O(τ ) to be an arbitrary small number,

90

Theorem 3.2 yields the following:
Corollary 3.3 (Sample Complexity). Under the same conditions as Theorem 3.2, by running
Algorithm 1 for minimizing LD (w) with a number of iterations (i.e., number of training
examples) T , if it holds that,
(
(
4
T ≥ O dκ log

1

1
log log
+ log
ϵprior
ϵprior
δ
1

))

where κ = β/α denotes the condition number of the loss function and d is the dimension of
data, then with a probability 1 − δ, w attains a risk of O(ϵprior ), i.e., LD (w) ≤ (1 + c)ϵprior .

As an example of a concrete problem that may be put into the setting of the present
work is the regression problem with squared loss. It is easy to show that average square loss
function is Lipschitz continuous with a Lipschitz constant β = λmax (X ⊤ X) which denotes
the largest eigenvalue of matrix X ⊤ X where X is the data matrix. The strong convexity
is guaranteed as long as the population data covariance matrix is not rank-deficient and its
minimum eigenvalue is lower bounded by a constant α > 0. For this problem, the optimal
minimax sample complexity is known to be O( 1ϵ ), but as it implies from Corollary 3.3,
by the knowledge of target risk ϵprior , it is possible to reduce the sample complexity to
O(log(1/ϵprior )).
Remark 3.4. It is indeed remarkable that the sample complexity of Theorem 3.2 has κ4 =
(β/α)4 dependency on the condition number of the loss function, which is worse than the
√
β/α dependency in the lower bound in (3.1). Also, the explicit dependency of sample
complexity on dimension d makes the proposed algorithm inappropriate for non-parametric
settings.
91

3.4

Analysis of Sample Complexity

Now we turn to proving the main theorem. The proof will be given in a series of lemmas
and theorems where the proof of few are given in the Section 3.5. The proof makes use of
the Bernstein inequality for martingales, idea of peeling process, self-bounding property of
smooth loss functions, standard analysis of stochastic optimization, and novel ideas to derive
the claimed sample complexity for the proposed algorithm.
The proof of Theorem 3.2 is by induction and we start with the key step given in the
following theorem.
Theorem 3.5. Assume ϵprior ≥ ϵopt . For a fixed stage k, if ∥wk − w∗ ∥ ≤ ∆k , then, with a
probability 1 − δ, we have
∥wk+1 − w∗ ∥2 ≤ a∆2k + bϵprior
where

a=

[
√ ] s)
√
2 (
2ξβ T1 + ξ 3 βd + 2ξβ d ln
,
αT1
δ

b=

8
αξ

(3.8)

√
and s is given in (3.7), provided that ξ ≥ 16β/α and η = 1/(2ξβ T1 ) hold.
Taking this statement as given for the moment, we proceed with the proof of Theorem 3.2,
returning later to establish the claim stated in Theorem 3.5.

Proof of Theorem 3.2. By setting a and b in (3.8) in Theorem 3.5 as a ≤ ε and b ≤ 2τ /β,
we have ξ ≥ 4β/(ατ ) and

T1 ≤

[
√ ] s)
√
2 (
2ξβ T1 + ξ 3 βd + 2ξβ d ln
αε
δ
92

implying that

{
T1 ≥ 4 max

}
√
ξ 3 βd + 2ξβ d s 16ξ 2 β 2
.
ln , 2 2
εα
δ α ε

Thus, using Theorem 3.5 and the definition of ξ and T1 , we have, with a probability 1 − δ,

∆2k+1 ≤ ε∆2k +

2τ
ϵ
.
β prior

After m stages, with a probability 1 − mδ, we have
2τ
∆2m+1 ≤ εm ∆21 + ϵprior
β

m−1
∑
i=0

εi ≤ εm ∆21 +

2τ
ϵ
.
β(1 − ε) prior

By the β-smoothness of LD (w), it implies that

LD (wm+1 ) − LD (w∗ ) ≤

β
β m 2
τ
∥wm+1 − w∗ ∥2 ≤
ε ∆1 +
ϵ
,
2
2
1 − ε prior
βR2 m
τ
≤
ε +
ϵ
,
2
1 − ε prior

where the last inequality follows from ∆1 ≤ R. The bound stated in the theorem follows the
assumption that LD (w∗ ) = ϵopt ≤ ϵprior .
We now turn to proving Theorem 3.5. To bound ∥wk+1 − w∗ ∥ in terms of ∆k , we
start with the standard analysis of online learning. In particular, from the strong convexity

93

assumption of LD (w) and updating rule (3.5) we have,
α
LD (wkt ) − LD (w∗ ) ≤ ⟨∇LD (wkt ), wkt − w∗ ⟩ − ∥wkt − w∗ ∥2
2
α
= ⟨vkt , wkt − w∗ ⟩ + ⟨∇LD (wkt ) − vkt , wkt − w∗ ⟩ − ∥wt − w∗ ∥2
2
t+1
t
2
2
∥wk − w∗ ∥ − ∥wk − w∗ ∥
ηd
≤
+ γk2
2η
2
α
+ ⟨∇LD (wkt ) − vkt , wkt − w∗ ⟩ − ∥wt − w∗ ∥2 ,
(3.9)
2
≜v t
k

√
where the last step follows from ∥vkt ∥ ≤ γk d. By adding all the inequalities of (7.1) at
stage k, we have
T1
∑

T

LD (wkt ) − LD (w∗ )

t=1

T

1
1
∑
∥wk − w∗ ∥2 dη 2
α∑
t
≤
+ γk T1 +
vk −
∥wt − w∗ ∥2
2η
2
2

≤

t=1
2
∆k dη 2
α
+ γk T1 + Vk − Wk ,

2η

where Vk and Wk are defined as Vk =

2

t=1

(3.10)

2

∑T 1

t
t=1 vk and Wk =

∑T1

t
2
t=1 ∥wk − w∗ ∥ , respectively.

In order to bound Vk , using the fact that ∇LD (wkt ) = Et [gkt ], we rewrite Vk as

Vk =

T1
∑

⟨−vkt + Et [vkt ], wkt − w∗ ⟩ +

t=1

T1
∑
t=1

≜dt
k

[ ]
⟨Et gkt − Et [vkt ], wkt − w∗ ⟩
≜et
k

= Dk + Ek ,

where Dk =

∑T 1

t
t=1 dk and Ek =

∑T1

t
t=1 ek which represent the variance and bias of the clipped

gradient vkt , respectively. We now turn to separately upper bound each term.
The following lemma bounds the variance term Dk using the Bernstein inequality for
martingale. Its proof can be found in Section 3.5.
94

Lemma 3.6. For any L > 0 and µ > 0, we have
)
(
)
(
(
√ ) s
ϵprior T1
1
2
Pr Wk ≤
+ Pr Dk ≤ Wk + Lγk d + γk ∆k d ln
≥1−δ
2µβ
L
δ

where s is given by

⌈

⌉
8βµR2
s = log2
.
ϵprior

The following lemma bounds Ek using the self-bounding property of smooth functions
and the proof is deferred to Section 3.5.2.
Lemma 3.7.
Ek ≤

4T1
4β
4T
4β
ϵopt +
Wk ≤ 1 ϵprior +
W .
ξ
ξ
ξ
ξ k

Note that without the knowledge of ϵprior , we have to bound ϵopt by Ω(1), resulting in
a very loose bound for the bias term Ek . It is knowledge of the target expected risk ϵprior
that allows us to come up with a significantly more accurate bound for the bias term Ek ,
which consequentially leads to a geometric convergence rate.
We now proceed to bound

∑T1

t
t=1 LD (wk ) − LD (w∗ ) using the two bounds in Lemma 3.6

and 3.7. To this end, based on the result obtained in Lemma 3.6, we consider two scenarios.
In the first scenario, we assume

Wk ≤

ϵprior T1
2µβ

(3.11)

In this case, we have
T1
∑

LD (wkt ) − LD (w∗ ) ≤

t=1

95

ϵprior
β
Wk ≤
T .
2
2µ 1

(3.12)

In the second scenario, we assume

Dk ≤

(
√ ) s
1
WT + Lγk2 d + γk ∆k d ln .
L
δ

(3.13)

ξ
In this case, by combining the bounds for Dk and Ek and setting L = 4β
, we have

(
)
√
8β
ξd 2
s 4T
Vk ≤
Wk +
γk + γk ∆k d ln + 1 ϵprior
ξ
4β
δ
ξ
)
(
√
8β
s 4T
=
Wk + ξ 3 βd + 2ξβ d ∆2k ln + 1 ϵprior ,
ξ
δ
ξ
α
where the last equality follows from the fact γk = 2ξβ∆k . If we choose ξ such that 8β
ξ ≤ 2

or ξ ≥ 16β
α > 1 holds, we get

Vk ≤

(
√ )
α
s 4T
Wk + ξ 3 βd + 2ξβ d ∆2k ln + 1 ϵprior
2
δ
ξ

Substituting the above bound for Vk into the inequality of (3.10), we have
T1
∑

LD (wkt ) − LD (w∗ ) ≤

t=1

By choosing η as η =

(
√ )
∆2k η 2
s 4T
+ γk T1 + ξ 3 βd + 2ξβ d ∆2k ln + 1 ϵprior
2η
2
δ
ξ

∆k
1
√ =
√
, we have
γk T 1
2ξβ T1

LD (wk+1 ) − LD (w∗ ) ≤

[
√ ] s) 2 4
√
1 (
2ξβ T1 + ξ 3 βd + 2ξβ d ln
∆k + ϵprior .
T1
δ
ξ

(3.14)

By combining the bounds in (3.12) and (3.14), under the assumption that at least one of the

96

two conditions in (3.11) and (3.13) is true, by setting µ = B/8, we have

LD (wk+1 ) − LD (w∗ ) ≤

[
√ ] s) 2 4
√
1 (
2ξβ T1 + ξ 3 βd + 2ξβ d ln
∆k + ϵprior ,
T1
δ
ξ

implying

∥wk+1 − w∗ ∥ ≤

[
√ ] s) 2
√
2 (
8
2ξβ T1 + ξ 3 βd + 2ξβ d ln
∆k +
ϵ
.
αT1
δ
αξ prior

We complete the proof by using Lemma 3.6, which states that the probability for either of
the two conditions hold is no less than 1 − δ.

3.5

Proofs of Sample Complexity

3.5.1

Proof of Lemma 3.6

The proof is based on the Bernstein’s inequality for martingales which can be found in
⟨
⟩
Lemma A.25. Define martingale diﬀerence dtk = wkt − w∗ , Et [vkt ] − vkt and martingale
Dk =

∑T 1

t
2
t=1 dk . Let ΣT denote the conditional variance as

Σ2T =

T1
∑

[

Et (dtk )2

]

≤

t=1

≤

T1
∑
t=1
T
∑

[
Et

]

2
Et [vkt ] − vkt
∥wkt − w∗ ∥2

dγk2 ∥wkt − w∥2 = dγk2 Wk ,

t=1

which follows from the Cauchy’s Inequality and the definition of clipping.
√
Define M = max |dtk | ≤ 2 dγk ∆k . To prove the inequality in Lemma 3.6, we follow the
t

97

idea of peeling process [92]. Since Wk ≤ 4R2 T1 , we have
(
)
√
√
Pr Dk ≥ 2γk Wk dρ + 2M ρ/3
)
(
√
√
2
= Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Wk ≤ 4R T1
)
(
√
√
= Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Σ2T ≤ γk2 dWk , Wk ≤ 4R2 T1
(
)
√
√
≤ Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Σ2T ≤ γk2 dWk , Wk ≤ ϵprior T1 /(2βµ)
(
)
s
i−1 T
iT
∑
√
√
ϵ
2
ϵ
2
1
1
prior
prior
+
Pr Dk ≥ 2γk Wk dρ + 2M ρ/3, Σ2T ≤ γk2 dWk ,
< Wk ≤
2βµ
2βµ
i=1
)
(
ϵprior T1
≤ Pr Wk ≤
2βµ


√
√
s
2
i
2
i+1
∑
ϵprior 2 T1 γk d
ϵprior 2 T1 γk d
2

+
Pr Dk ≥
ρ+
M ρ, Σ2T ≤
2βµ
3
2βµ
i=1
(
)
ϵprior T1
≤ Pr Wk ≤
+ se−ρ ,
2βµ

where s is given by

⌉
⌈
8βµR2
.
s = log2
ϵprior

The last step follows the Bernstein inequality for martingales. We complete the proof by
setting ρ = ln(s/δ) and using the fact that

2γk

3.5.2

√
1
Wk ρd ≤ Wk + γk2 ρdL.
L

Proof of Lemma 3.7

To bound Ek , we need the following two lemmas. The first lemma bounds the deviation of
the expected value of a clipped random variable from the original variable, in terms of its
variance (Lemma A.2 from [74]).
98

Lemma 3.8. Let X be a random variable, let X = clip(X, C) and assume that |E[X]| ≤ C/2
for some C > 0. Then
|E[X] − E[X]| ≤

2
|Var[X]|
C

Another key observation used for bounding Ek is the fact that for any non-negative βsmooth convex function, we have the following self-bounding property. We note that this
self-bounding property has been used in [138] to get better (optimistic) rates of convergence
for non-negative smooth losses.
Lemma 3.9. For any β-smooth non-negative function f : R → R, we have |f ′ (w)| ≤
√
4βf (w)
Proof. See Appendix ??
Proof of Lemma 3.7. To apply the above lemmas, we write etk as

etk

=

d
∑

)]
[
(
Et ℓ′ (⟨wkt , xtk ⟩, yt )[xtk ]i − clip γk , ℓ′ (⟨wkt , xtk ⟩, yt )[xtk ]i [wkt − w∗ ]i

i=1

In order to apply Lemma 3.8, we check if the following condition holds
]
[ (
)
γk ≥ 2 Et ℓ′ ⟨wkt , xtk ⟩, yt [xtk ]i

(3.15)

Since

≤

)
]
[ (
Et ℓ′ ⟨wkt , xtk ⟩, yt [xtk ]i
[ (
)
]
(
)}
]
)
[{ (
Et ℓ′ ⟨wkt , xtk ⟩, yt − ℓ′ ⟨w∗ , xtk ⟩, yt [xtk ]i + Et ℓ′ ⟨w∗ , xtk ⟩, yt [xtk ]i

≤ β∥wkt − w∗ ∥ ≤ β∆k
99

[ (
)
]
where the last inequality follows from Et ℓ′ ⟨w∗ , xtk ⟩, yt [xtk ]i = 0 since w∗ is the minimizer
of LD (w), we thus have
)
]
[ (
γk = 2ξβ∆k ≥ 2β∆k ≥ 2 Et ℓ′ ⟨wkt , xtk ⟩, yt [xtk ]i

where ξ ≥ 1, implying that the condition in (3.15) holds. Thus, using Lemma 3.8, we have

etk

≤
≤

[

(
)2
1
[wkt − w∗ ]i
Et ℓ′ (⟨wkt , xtk ⟩, yt )[xtk ]i
γk
i=1
[(
)2 ]
2∥wkt − w∗ ∥∞
′
t
t
Et ℓ (⟨wk , xk ⟩, yt )
γk

d
∑

]

Using Lemma 3.9 to upper bound the right hand side, we further simplify the above bound
for etk as
)]
8β∥wkt − w∗ ∥∞ [ ( t t
Et ℓ ⟨wk , xk ⟩, yt
γk
t
8β∥wk − w∗ ∥∞
LD (wkt )
=
γk
8β∆k
≤
LD (wkt )
γk
4
L (wt )
=
ξ D k

etk ≤

where the second inequality follows from ∥wkt − w∗ ∥∞ ≤ ∥wkt − w∗ ∥ ≤ ∆k . Therefore we

100

obtain

Ek =

T1
∑
t=1

4
etk ≤
ξ

T1
∑

LD (wkt )

t=1

T

T

t=1

t=1
T1

1
1
4∑
4∑
=
LD (w∗ ) +
LD (wkt ) − LD (w∗ )
ξ
ξ

≤

4T1
4β ∑ t
∥wk − w∗ ∥2
LD (w∗ ) +
ξ
ξ
t=1

4T1
4β
=
LD (w∗ ) +
W ,
ξ
ξ k
where the second inequality follows from the smoothness assumption of LD (w).

3.6

Summary

In this chapter, we have studied the sample complexity of passive learning when the target
expected risk is given to the learner as prior knowledge. The crucial fact about target
risk assumption is that, it can be fully exploited by the learning algorithm and stands in
contrast to most common types of prior knowledges that usually enter into the generalization
bounds and are often perceived as a rather crude way to incorporate such assumptions. We
showed that by explicitly employing the target risk ϵprior in a properly designed stochastic
optimization algorithm, it is possible to attain the given target risk ϵprior with a logarithmic
(
)
1
sample complexity log ϵ
, under the assumption that the loss function is both strongly
prior

convex and smooth.
There are various directions for future research. The current study is restricted to the
parametric setting where the hypothesis space is of finite dimension. It would be interesting
to see how to achieve a logarithmic sample complexity in a non-parametric setting where
hypotheses lie in a functional space of infinite dimension. Evidently, it is impossible to extend

101

the current algorithm for the non-parametric setting; therefore additional analysis tools are
needed to address the challenge of infinite dimension arising from the non-parametric setting.
It is also an interesting problem to relate target risk assumption we made here to the low
noise margin condition which is often made in active learning for binary classification since
both settings appear to share the same sample complexity. However it is currently unclear
how to derive a connection between these two settings. We believe this issue is worthy of
further exploration and leave it as an open problem.

3.7

Bibliographic Notes

Sample complexity of passive learning is well established and goes back to early works in
(
)
(
)
the learning theory where the lower bounds Ω 1ϵ (log 1ϵ + log 1δ ) and Ω 12 (log 1ϵ + log 1δ )
ϵ

were obtained in classic PAC and general agnostic PAC settings, respectively [52, 26, 7].
There has been an upsurge of interest over the last decade in finding tight upper bounds on
the sample complexity by utilizing prior knowledge on the analytical properties of the loss
function, that led to stronger generalization bounds in agnostic PAC setting. In [95] fast
rates obtained for squared loss, exploiting the strong convexity of this loss function, which
only holds under pseudo-dimensionality assumption. With the recent development in online
strongly convex optimization [68], fast rates approaching O( 1ϵ log 1δ ) for convex Lipschitz
strongly convex loss functions has been obtained in [140, 82]. For smooth non-negative loss
functions, [138] improved the sample complexity to optimistic rates
( (
)(
))
1 ϵopt + ϵ
1
1
3
O
+ log
log
ϵ
ϵ
ϵ
δ

102

for non-parametric learning using the notion of local Rademacher complexity [13], where
ϵopt is the optimal risk.
The proposed ClippedSGD algorithm is related to the recent studies that examined the
learnability from the viewpoint of stochastic convex optimization. In [139, 133], the authors
presented learning problems that are learnable by stochastic convex optimization but not by
empirical risk minimization (ERM). Our work follows this line of research. The proposed
algorithm achieves the sample complexity of O(d log(1/ϵprior )) by explicitly incorporating
the target expected risk ϵprior into the stochastic convex optimization algorithm. It is however diﬃcult to incorporate such knowledge into the framework of ERM. Furthermore, it is
worth noting that in [127, 139, 126, 20], the authors explored the connection between online
optimization and statistical learning in the opposite direction. This was done by exploring the complexity measures developed in statistical learning for the learnability of online
learning. We note that our work does not contradict the lower bound in [138] because a
feasible target risk ϵprior is given in our learning setup and is fully exploited by the proposed
algorithm. Knowing that the target risk ϵprior is feasible makes it possible to improve the
sample complexity from O(1/ϵprior ) to O(log(1/ϵprior )). We also note that although the
logarithmic sample complexity is known for active learning [66, 12], we are unaware of any
existing passive learning algorithm that is able to achieve a logarithmic sample complexity
by incorporating any kind of prior knowledge.
The proposed algorithm is also closely related to the recent works that stated O(1/n)
is the optimal convergence rate for stochastic optimization when the objective function is
strongly convex [76, 72, 125]. In contrast, the proposed algorithm is able to achieve a
geometric convergence rate for a target optimization error. Similar to the previous argument,
our result does not contradict the lower bound given in [72] because of the knowledge of a
103

feasible optimization error. Moreover, in contrast to the multistage algorithm in [72] where
the size of stages increases exponentially, in our algorithm, the size of each stage is fixed to
be a constant.

104

Chapter 4
Statistical Consistency of Smoothed
Hinge Loss
In Chapter 2 we discussed that convex surrogates of the 0-1 loss are highly preferred because
of the computational and theoretical virtues that convexity brings in and most prominent
practical methods studied in machine learning make significant use of convexity. This is
of more importance if we consider smooth surrogates as witnessed by the fact that the
smoothness is further beneficial both computationally- by attaining an optimal convergence
rate for optimization, and in a statistical sense- by providing an improved optimistic rate
for generalization bound.
This chapter concerns itself with the statistical consistency of smooth convex surrogates.
The statistical consistency finds general quantitative relationships between the excess risk
errors associated with convex and those associated with 0-1 loss. Consistency results provide
reassurance that optimizing a surrogate does not ultimately hinder the search for a function
that achieves the binary excess risk, and thus allow such a search to proceed within the
scope of computationally eﬃcient algorithms. Statistical consistency of surrogates under
conditions such as convexity is a well studied problem in learning community and quantitative
relationships between binary risk and convex excess risk has been established. In this chapter
we investigate the smoothness property from the viewpoint of statistical consistency and

105

show how it aﬀects the binary excess risk. We show that in contrast to optimization and
generalization errors that favor the choice of smooth surrogate loss, the smoothness of loss
function may degrade the binary excess risk. Motivated by this negative result, we provide
a unified analysis that integrates optimization error, generalization bound, and the error
in translating convex excess risk into a binary excess risk when examining the impact of
smoothness on the binary excess risk. We show that under favorable conditions appropriate
√
choice of smooth convex loss will result in a binary excess risk that is better than O(1/ n).
The reminder of this paper is organized as follows. In Section 4.1 we set up notation
and describe the setting. Section 4.2 briefly discusses the classification-calibrated convex
surrogate losses on which our analysis relies. We derive the ψ-transform for smoothed hinge
loss and elaborate its binary excess risk in Section 4.3. Section 4.4 provides a unified analysis
of three types of errors and derives conditions in terms of smoothness to obtain better rates
for the binary excess risk. The omitted proofs are included in Section 4.5. Section 4.6
concludes the paper.

4.1

Motivation

Let S = ((x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )) be a set of i.i.d. samples drawn from an unknown
distribution D over Ξ = X × {−1, +1}, where xi ∈ X ⊆ Rd is an instance and yi ∈ {−1, +1}
is the binary class assignment for xi . Let κ(·, ·) be an universal kernel and let Hκ be the
Reproducing Kernel Hilbert Space (RKHS) endowed with kernel κ(·, ·). According to [157],
Hκ is a rich function space whose closure includes all the smooth functions. We consider
predictors from Hκ with bounded norm to form the measurable function class H = {h ∈
Hκ : ∥h∥Hκ ≤ B}.

106

For a function h : X → R, the risk of h is defined as:

LD (h) = E(x,y)∼D [I[yh(x) ≤ 0]] = P [yh(x) ≤ 0] .

Let h∗ be the optimal classifier that attains the minimum risk, i.e.

h∗ = arg min P [yh(x) ≤ 0] .
h

We assume h∗ ∈ Hκ with ∥h∗ ∥Hκ ≤ B. This boundedness condition is satisfied for any
RKHS with a bounded kernel (i.e. supx∈X κ(x, x) ≤ B). Henceforth, let L∗D stand for the
minimum achievable risk by the optimal classifier h∗ , i.e., L∗D = LD (h∗ ). Define the binary
excess risk for a prediction function h ∈ H as

E(h) = LD (h) − L∗D .

Our goal is to eﬃciently learn a prediction function h ∈ H from the training examples
in S that minimizes the binary excess risk E(h). Many studies of binary excess risk assume that the optimal classifier h ∈ H is learned by minimizing the empirical binary risk,
∑
minh∈H n1 n
i=1 I[yi h(xi ) ≤ 0], an approach that is usually referred to as Empirical Risk
Minimization (ERM) [145]. To understand the generalization performance of the classifier
learned by ERM, it is important to have upper bounds on the excess risk of the empirical
minimizer that hold with a high probability and that take into account complexity measures
of classification functions. It is well known that, under certain conditions, direct empirical classification error minimization is consistent [145] and achieves a fast convergence rate
under low noise situations [109].
107

One shortcoming of the ERM based approaches is that they need to minimize 0-1 loss,
leading to non-convex optimization problems that are potentially NP-hard 1 [8, 75]. A
common practice to circumvent this diﬃculty is to replace the indicator function I[· ≤
0] with some convex loss ϕ(·) and find the optimal solution by minimizing the convex
surrogate loss. Examples of such surrogate loss functions for 0-1 loss include logit loss
ϕlog (h; (x, y)) = log(1+exp(−yh(x))) in logistic regression [61], hinge loss ϕHinge (h; (x, y)) =
max(0, 1−yh(x)) in support vector machine (SVM) [43] and exponential loss ϕexp (h; (x, y)) =
exp(−yh(x)) in AdaBoost [58]. Given a convex surrogate loss function ϕ : R → R+ (e.g.,
hinge loss, exponential loss, or logistic loss) we define the risk with respect to the convex
loss ϕ (convex risk or ϕ-risk) as

ϕ

LD (h) = E(x,y)∼D [ϕ(yh(x))].
ϕ,∗

Similarly we define the optimal ϕ-risk as LD = inf h∈H E(x,y)∼D [ϕ(yh(x))]. The excess
ϕ-risk or convex excess risk of a classifier h ∈ H with respect to the convex surrogate loss
ϕ(·) is defined as
ϕ

ϕ,∗

Eϕ (h) = LD (h) − LD .
An important line of research in statistical learning theory focused on relating the convex
excess risk Eϕ (h) to the binary excess risk E(h) that will be elaborated in next section.
It is known that under mild conditions, the classifier learned by minimizing the empirical
loss of convex surrogate is consistent to the Bayes classifier [156, 101, 80, 97, 141, 14]. For
1 We note that several works [84, 85] provide eﬃcient algorithms for direct 0-1 empirical error minimization
but under strong (unrealistic) assumptions on data distribution or label generation.

108

instance, it was shown in [14] that the necessary and suﬃcient condition for a convex loss ϕ(·)
to be consistent with the binary loss is that ϕ(·) is diﬀerentiable at origin and ϕ′ (0) < 0. It
was further established in the same work that the binary excessive risk can be upper bound
by the convex excess risk through a ψ-transform that depends on the surrogate convex loss
ϕ(·).
Since the choice of convex surrogates could significantly aﬀect the binary excess risk, in
this chapter, we will investigate the impact of the smoothness of a convex loss function on the
binary excess risk. This is motivated by the recent results that show the advantages of using
smooth convex surrogates in reducing the optimization complexity and the generalization
error bound. More specifically, [121, 143] show that a faster convergence rate (i.e., O(1/T 2 ))
can be achieved by first order methods when the objective function to be optimized is convex
and smooth such as accelerated gradient descent method introduced in Chapter 2; in [138],
the authors show that a smooth convex loss will lead to a better optimistic generalization
error bound rooted in the self-bounding property of smooth losses (Lemma A.12). Given
the positive news of using smooth convex surrogates, an open research question is how the
smoothness of a convex surrogate will aﬀect the binary excess risk. The answer to this
question, as will be revealed later, is negative: the smoother the convex loss, the poorer
approximation will be for the binary excess risk. Thus, the second contribution of this
work is to integrate these results for smooth convex losses, and examine the overall eﬀect of
replacing 0-1 loss with a smooth convex loss when taking into account three sources of errors,
i.e. the optimization error, the generalization error, and the error in translating the convex
excess risk into the binary risk. As we will show, under favorable conditions, appropriate
√
choice of smooth convex loss will result a binary excess risk better than O(1/ n).

109

4.2

Classification Calibration and Surrogate Risk Bounds

Although it is computationally convenient to minimize the empirical risk based on a convex
surrogate, the ultimate goal of any classification method is to find a function h ∈ Hκ
that minimizes the binary loss. Therefore, it is crucial to investigate the conditions which
ϕ,∗

guarantee that if the ϕ-risk of h gets close to the optimal LD , the binary risk of h will also
approach the optimal binary risk L∗D . This question has been an active trend in statistical
learning theory over the last decade where the necessary and suﬃcient conditions have been
established for relating the binary excess risk to a convex excess risk [156, 101, 80, 97, 141,
14].
In this chapter we follow the strategy introduced in [14] in order to relate the binary
excess risk to the excess ϕ-risk. Their methodology, through the notion of classification
calibration, allows us to find quantitative relationship between the excess risk associated
with ϕ and the excess risk associated with 0-1 loss. It is established in [14] that the binary
excessive risk can be bounded by the convex excess risk, based on the convex loss function
ϕ, through a ψ-transform.
Definition 4.1. Given a loss function ϕ : R → [0, ∞), define the function ψ : [0, 1] → [0, ∞)
by
˜
ψ(z)
= H−

(

1+z
2

)

(
−H

1+z
2

)

where

H − (η) =

inf
α:α(2η−1)≤0

(ηϕ(α) + (1 − η)ϕ(−α)) and H(η) = inf (ηϕ(α) + (1 − η)ϕ(−α)) .
α∈R

˜
The transform function ψ : [0, 1] → [0, ∞) is defined to be the convex closure of ψ.
110

The following theorem from [14, Theorem 1] shows that the binary excess risk can be
bounded by the convex excess risk using transform function ψ : [0, 1] → [0, ∞) that depends
on the surrogate convex loss function.
Theorem 4.2. For any non-negative loss function ϕ(·), any measurable function h ∈ H, and
any probability distribution D on X × Y, there is a nondecreasing function ψ : [0, 1] → [0, ∞)
that

ψ(LD (h) − L∗D ) ≤ LD (h) − LD
ϕ

ϕ,∗

(4.1)

holds. Here the minimization is taken over all measurable functions.
Definition 4.3. A convex loss ϕ is classification-calibrated if, for any η ̸= 1/2,

H − (η) > H(η).

This condition is essentially an extension of [156, Theorem 2.1] and can be viewed as a
form of Fisher consistency that is appropriate for classification.
It has been shown in [14] that the necessary and suﬃcient condition for a convex loss
ϕ(z) to be classification-calibrated is if it is diﬀerentiable at the origin and ϕ′ (0) < 0. In
particular, for a certain convex function ϕ(·), the ψ-transform can be computed by
(
ψ(z) = inf

αz≤0

)
(
)
1−z
1+z
1−z
1+z
ϕ(α) +
ϕ(−α) − inf
ϕ(α) +
ϕ(−α) ,
2
2
2
2
α∈R

(
)
that can be further simplified as ψ(z) = ϕ(0) − H 1+z
when ϕ is classification-calibrated.
2
Examples of ψ-transform for the convex surrogate functions of known practical algorithms

111

mentioned before are as follows: (i) for hinge loss ϕ(α) = max(0, 1 − α) , ψ(z) = |z|, (ii) for
√
exponential loss ϕ(α) = e−α , ψ(z) = 1 − 1 − z 2 ≥ z 2 /2, and (iii) for truncated quadratic
loss ϕ(α) = [max(0, 1 − α)]2 , ϕ(z) = z 2 .
Remark 4.4. We note that the inequality in (4.1) provides insuﬃcient guidance on choosing
appropriate loss function. A few brief comments are appropriate. First, it does not measure
ϕ

ϕ,∗

explicitly how the choice of the convex surrogate ϕ(·) aﬀects the excess risk LD (h) − LD .
Second, it does not take into account the impact of loss function on optimization eﬃciency,
an important issue for practitioners when dealing with big data. It is thus unclear, from
Theorem 4.2, how to choose an appropriate loss function that could result in a small generalization error for the binary loss when the computational time is limited. In this chapter,
we address these limitations by examining a family of convex losses that are constructed
by smoothing the hinge loss function using diﬀerent smoothing parameters. We study the
binary excessive risk of the learned classification function by taking into account errors in
optimization, generalization, and translation of convex excess risk into binary excess risk.

4.3

Binary Excess Risk for Smoothed Hinge Loss

As stated before, to eﬃciently learn a prediction function h ∈ H, we will replace the binary
loss with a smooth convex loss. Since hinge loss is one of the most popular loss functions used
in machine learning and is the loss of choice for classification problems in terms of the margin
error [19], in this work, we will focus on the smoothed version of the hinge loss. Another
advantage of using the hinge loss is that its ψ-transform is a linear function. Compared with
the ψ-transforms of other popular convex loss functions (e.g. exponential loss and truncated
square loss) that are mostly quadratic, using the hinge loss as convex surrogate will lead to
112

a tighter bound for the binary excess risk.
The smoothed hinge loss considered in this chapter is defined as

1
ϕ(z; γ) = max α(1 − z) + R(α),
γ
α∈[0,1]

(4.2)

where R(α) = −α log α − (1 − α) log(1 − α) and γ > 0 is the smoothing parameter. It is
straightforward to verify that the loss function in (4.2) can be simplified as

ϕ(z; γ) =

1
log(1 + exp(γ(1 − z))).
γ

It is not immediately clear from Theorem 4.2 how the relationship between smooth convex
excess risk Eϕ (·) and binary excess risk is aﬀected by the smoothness parameter γ. In addition, as discussed in [14], whereas conditions such as convexity and smoothness have natural
relationship to optimization and generalization, it is not immediately obvious how properties
such as convexity and smoothness of convex surrogate relates to statistical consequences. In
what follows, we show that, indeed smoothness of loss function has a negative statistical
consequence and can degrade the binary excess risk.

4.3.1

ψ-Transform for Smoothed Hinge Loss

The first step in our analysis is to derive the ψ-transform for the loss function defined in (4.2)
as stated in the following theorem.
Theorem 4.5. The ψ-transform of smoothed hinge loss with smoothing parameter γ is given

113

by

1+η
ψ(η; γ) = −
log
2γ

(

[
])
(
[
])
1
1−η
1
C1
C2
γ
γ
1+e
−
log
1+e
1 + eγ
1+η
2γ
1 + eγ
1−η

where C1 and C2 are defined as C1 = −ηeγ +

√
√
η 2 e2γ + 1 − η 2 and C2 = ηeγ + η 2 e2γ + 1 − η 2 .

The ψ-transform given in Theorem 4.5 is too complicated to be useful. The theorem
below provides a simpler bound for the ψ-transform in terms of the smoothness parameter
γ.
Theorem 4.6. For η ∈ (−1, 1), we have

ψ(η; γ) ≥ |η| −

1
1
log .
γ
|η|

Remark 4.7. The bound obtained in Theorem 4.6 demonstrates that when γ approaches to
infinity, the ψ-transform for smoothed hinge loss ϕ(η; γ) becomes |η|. According to [14], the
ψ-transform for the hinge loss is ψ(η) = |η|. Therefore, this result is consistent with the
ψ-transform for smoothed hinge loss, which is the limit of ϕ(z; γ) as γ approaches infinity.

4.3.2

Bounding E(h) based on Eϕ (h)

Based on the transform function ψ(·; γ) that is computed for smoothed hinge loss with
smoothing parameter γ, we are now in the position to bound its corresponding binary excess
risk E(h). Our main result in this section is the following theorem that shows how binary
excess risk can be bounded by the excess ϕ-risk for smoothed hinge loss.
Theorem 4.8. Consider any measurable function h ∈ H and the smoothed hinge loss ϕ(·)
with parameter γ defined in (4.2). Then, binary excess risk E(h) can be bounded by the
114

smooth convex excess risk Eϕ (h) as

E(h) ≤ Eϕ (h) +

Eϕ (h)
1
log
.
1 + γEϕ (h)
Eϕ (h)

Proof. Using the result from Theorem 4.2, we have Eϕ (h) ≥ ψ(E(h); γ) and therefore an immediate result from the ψ-transform for smoothed hinge loss that is obtained in Theorem 4.6
indicates
E(h) +

1
log E(h) ≤ Eϕ (h).
γ

Define ∆ = E(h) − Eϕ (h). We have
(
)
1
1
∆
1
≤ 0.
∆ + log(∆ + Eϕ (h)) = ∆ + log Eϕ (h) + log 1 +
γ
γ
γ
Eϕ (h)
Based on the log(1 + x) ≤ x inequality, the suﬃcient condition for the above inequality to
hold is to have
∆+

∆
1
1
≤ log
γEϕ (h)
γ
Eϕ (h)

and therefore

∆≤

Eϕ (h)
γ −1
1
1
log
=
log
.
Eϕ (h)
1 + γEϕ (h)
Eϕ (h)
1 + (γEϕ (h))−1

The final bound is obtained by substituting E(h) − Eϕ (h) for ∆ in the left hand side of above
inequality.
As indicated by Theorem 4.8, the smaller the smoothing parameter γ, the poorer the
approximation is in bounding the binary excess E(h) with smooth convex excess risk Eϕ (h).
On the other hand, the smoothness of loss function has been proven to be beneficial in terms
115

of optimization error and generalization bound. The mixture of negative and positive results
for using smooth convex surrogates motivates us to develop an integrated bound for binary
excess risk that takes into account all types of errors. One of the main contributions of
this work is to show that under favorable conditions, with appropriate choice of smoothing
parameter, the smoothed hinge loss will result in a bound for the binary excess risk better
√
than O(1/ n).

4.4

A Unified Analysis

Using the smoothed hinge loss, we define the convex loss for a prediction function h ∈ H as
LD (h) = E[ϕ(yh(x); γ)]. Let h∗γ be the optimal classifier that minimizes LD (h). Similar to
ϕ

ϕ

the case of binary loss, we assume h∗γ ∈ Hκ with ∥h∗γ ∥ ≤ B. The smooth convex excess risk
for a given prediction function h ∈ H is then given by Eϕ (h) = LD (h) − LD (h∗γ ). Given the
ϕ

ϕ

smooth convex loss ϕ(z; γ) in (4.2), we find the optimal classifier by minimizing the empirical
ϕ

ϕ

convex loss, i.e. minh∈Hκ ,∥h∥ ≤B LS (h), where the empirical convex loss LS (h) is given
Hκ
by
1∑
ϕ(yi h(xi ); γ).
n
n

ϕ

LS (h) =

(4.3)

i=1

Let h be the solution learned from solving the empirical convex loss over training examples.
There are three sources of errors that aﬀect bounding the binary excess risk E(h). First,
since h is obtained by numerically solving an optimization problem, the error in estimating
the optimal solution, which we refer to as optimization error 2 , will aﬀect E(h). Additionally,
2 We note that in literature the error in estimating the optimal solution for empirical minimization is usually referred to as estimation error. We emphasize it as optimization error because diﬀerent convex surrogates
could lead to very diﬀerent iteration complexities and consequentially diﬀerent optimization eﬃciency.

116

since the binary excess risk can be bounded by a nonlinear transform of the convex excess
risk, both the bound for Eϕ (h) and the error in approximating E(h) with Eϕ (h) will aﬀect
the final estimation of E(h). We aim at investigating how the smoothing parameter γ aﬀect
all these three types of errors. As it is investigated in Theorem 4.8, a smaller smoothing
parameter γ will result in a poorer approximation of E(h). On the other hand, a smaller
smoothing parameter γ will result in a smaller estimation error and a smaller bound for
Eϕ (h). Based on the understanding of how smoothing parameter γ aﬀects the three errors,
we identify the choice of γ that results in the best tradeoﬀ between all three error and
√
consequentially a binary excess risk E(h) better than O(1/ n).
To investigate how the smoothing parameter γ aﬀects the binary excess risk E(h), we
intend to unify three types of errors. The analysis is comprised of two components, i.e.
bounding the binary excess risk E(h) by a smooth convex excess risk Eϕ (h) that has been
established in Theorem 4.8 and bounding Eϕ (h) for a solution h that is suboptimal in miniϕ

mizing the empirical convex loss LS (h) that is the focus of this section.

4.4.1

Bounding Smooth Excess Convex Risk Eϕ (h)

We now turn to bounding the excess ϕ-risk Eϕ (h) for the smoothed hinge loss. To bound
Eϕ (h) we need to consider two types of errors: optimization error due to the approximate
optimization of the empirical ϕ-risk, and the generalization error bound for the empirical
risk minimizer. After obtaining these two errors for smooth convex surrogates, we provide a
unified bound on the excess ϕ-risk Eϕ (h) of empirical convex risk minimizer in terms of n.
We begin by bounding the error arising from solving the optimization problem numerically. One nice property of smoothed hinge loss function is that both its first order and

117

second order derivatives are bounded, i.e.

|ϕ′ (z; γ)| =

exp(γ(1 − z)
≤ 1,
1 + exp(γ(1 − z))

ϕ′′ (z; γ) = γ

exp(γ(1 − z))
γ
≤ .
2
4
(1 + exp(γ(1 − z)))

Due to the smoothness of ϕ(z; γ), we can apply the accelerated optimization algorithm [121,
143] to achieve an O(1/k 2 ) convergence rate for the optimization, where k is the number of iterations the optimization algorithm proceeds (see e.g., accelerated gradient descent algorithm
in Subsection 2.3.2.2). More specifically, we will apply Algorithm 1 from [143] to solve the numerical optimization problem in (4.3) over the convex domain H = {h ∈ Hκ : ∥h∥Hκ ≤ B}
which results in the following updating rules at sth iteration:

gs = (1 − θs )hs + θs fs
)
(
θs
ϕ
fs+1 = arg min ⟨∇LS (gs ), f − gs ⟩ + ∥f − fs ∥Hκ
2
f ∈H

(4.4)

hs+1 = (1 − θs )hs + θs fs+1 .
The following theorem that follows immediately from [143, Corollary 1] and the fact ϕ′′ (z; γ) ≤
γ/4, bounds the optimization error for the optimization problem after k iterations.
Lemma 4.9. Let h = hk+1 be the solution obtained by running accelerated gradient descent
method (i.e., updating rules in (4.4)) to solve the optimization problem in (4.3) after k
iterations with θ0 = 1 and θk = 2/(k + 2) for k ≥ 1. We have

ϕ

LS (h) ≤

min

ϕ

∥h∥H ≤B
κ

LS (h) +

γB 2
.
(k + 2)2

We now turn to understanding the generalization error for the smooth convex loss. There
are many theoretical results giving upper bounds of the generalization error. However, a
118

recent result [138] has showed that it is possible to obtain optimistic rates for generalization
bound of smooth convex loss (in the sense that smooth losses yield better generalization
bounds when the problem is easier), which are more appealing than the generalization of
simple Lipschitz continuous losses. The following theorem from [138, Theorem 1] bounds
the generalization error for any solution h ∈ H when the learning has been performed by a
smooth convex surrogate ϕ(·).
Lemma 4.10. With a probability 1 − δ, for any ∥h∥Hκ ≤ B, we have
(

)
2 )t
(B
+
γB
ϕ
ϕ
ϕ
LD (h) − LS (h) ≤ K1
LS (h)
n
)
(
√
2 )t
2 )t
(B
+
γB
(B
+
γB
ϕ
ϕ
ϕ
LD (h) − LS (h) ≤ K2
+ LD (h)
.
n
n
(B + γB 2 )t
+
n

√

where t = log(1/δ) + log3 n and K1 and K2 are universal constants.
˜ √n)
The bound stated in this lemma is optimistic in the sense that it reduces to O(1/
˜
when the problem is diﬃcult and be better when the problem is easier, approaching O(1/n)
ϕ,∗

for linearly separable data, i.e., LD

= 0 in the second inequality. These two lemmas

essentially enable us to transform a bound on the optimization error and generalization
bound into a bound on the convex excess risk. In particular, by combining Lemma 4.9 with
Lemma 4.10, we have the following theorem that bounds the smooth convex excess risk
Eϕ (h) = LD (h) − LD (h∗λ ) for the empirical convex risk minimizer.
ϕ

ϕ

Theorem 4.11. Let h be the solution output from updating rules in (4.4) after k iterations.
Then, with a probability at least 1 − δ, we have
γB 2
Eϕ (h) ≤
+K
(k + 2)2

(

(B + γB 2 )t
+
n

√

2
ϕ,∗ (B + γB )t
LD
+
n

119

√

γB 2 (B + γB 2 )t
(k + 2)2 n

)

ϕ,∗

ϕ

where K is a universal constant, t = log(1/δ) + log3 n, and LD = min∥h∥ ≤B LD (h).
Hκ
Since our overall interest is to understand how the smoothing parameter γ aﬀects the
convergence rate of excess risk in terms of n, the number of training examples, it is better
to parametrize both the number of iterations k and smoothing parameter γ in n, and bound
the Eϕ (h) only in terms of n. This is given in the following corollary.
Corollary 4.12. Assume γ ≥ 1 and B ≥ 1. Paramertize k and γ in terms of n as k + 2 =
nα/2 and γ = nβ . Then, with a probability at least 1 − δ,
)
(
ϕ,∗
Eϕ (h) ≤ C(B, t) nβ−α + nβ−1 + nβ−(α+1)/2 + [LD ]1/2 n(β−1)/2

(4.5)

where C(B, t) is a constant depending on both B and t with t = log(1/δ) + log3 n.
ϕ,∗

ϕ,∗

The bound given in (4.5) depends on LD . We would like to further characterize LD in
terms of γ. First, we have

1
max max(0, 1 − z) + R(α)
γ
α∈[0,1]
1
log 2
≤ max max(0, 1 − z) + log 2 = ϕHinge (z) +
,
γ
γ
α∈[0,1]

ϕ(z; γ) =

where ϕHinge (z) = max(0, 1 − z) is the hinge loss. As a result, we have

ϕ,∗

Hinge,∗

LD ≤ LD
Hinge,∗

where LD

=

min

∥h∥H ≤B
κ

+

log 2
γ

[
]
E(x,y)∼D ϕHinge (yh(x)) is the optimal risk with respect to the

120

hinge loss. In general, we will assume

ϕ,∗

Hinge,∗

LD ≤ LD

a
+ 1+ξ
γ

(4.6)

ϕ,∗

Hinge,∗

where a > 0 is a constant and ξ ≥ 0 characterizes how fast LD will converge to LD

with increasing γ. To see why the assumption in (4.6) is sensible, consider the case when
the optimal classifier h∗Hinge = arg min∥h∥ ≤B LD
Hκ

Hinge

(h) can perfectly classify all the data

points with margin ϵ, in which we have

ϕ,∗
Hinge,∗
LD ≤ LD
+O

( −ϵγ )
e
γ

which satisfy the condition in (4.6) with arbitrarily large ξ. It is easy to verify that the
condition (4.6) holds with ξ > 0 if h∗Hinge can perfectly classify O(1 − γ −1−ξ ) percentage of
data with margin ϵ.
Using the assumption in (4.6), we have the following result that characterizes the smooth
Hinge,∗

convex excess risk bound Eϕ (h) stated in terms of the parameters α, δ and LD
Theorem 4.13. Assume α ≥ 1/2. Set β as

β=

min(1/2, α − 1/2)
.
1+ξ

With a probability 1 − δ, we have

Eϕ (h) ≤ O(n−τ1 + [LD

Hinge,∗ 1/2 −τ2
] n
)

121

.

where
τ1 =

1 + 2ξ min(1, α)
,
2(1 + ξ)

τ2 =

1/2 + ξ
2(1 + ξ)

ϕ,∗

Proof. Replacing LD in Corollary 4.12 with the expression in (4.6), we have, with a probability 1 − δ,
(

Hinge,∗ 1/2 (β−1)/2
Eϕ (h) ≤ C(R, t, a) nβ−α + nβ−1 + nβ−(α+1)/2 + [LD
] n
+ n−1/2−ξβ

We first consider the case when α > 1. In this case, we have
(

Hinge,∗ 1/2 (β−1)/2
Eϕ (h) ≤ O nβ−1 + n−1/2−ξβ + [LD
] n

)

1/2

By choosing β − 1 = −1/2 − ξβ, we have β = 1+ξ and
Eϕ (h) ≤ O(n−(1/2+ξ)/(1+ξ) + [LD

Hinge,∗ 1/2 −(1/2+ξ)/[2(1+ξ)]
] n

In the second case, we have α ∈ [1/2, 1]. Hence we have
(
)
Hinge,∗ 1/2 (β−1)/2
Eϕ (h) ≤ O nβ−α + [LD
] n
+ n−1/2−ξβ
α−1/2

By setting β − α = −1/2 − ξβ, we have β = 1+ξ and
(
Eϕ (h) ≤ O

ξα+1/2
Hinge,∗ 1/2 −(1/2+ξ)/[2(1+ξ)]
n 1+ξ + [LD
] n
−

We complete the proof by combining the results for the two cases.

122

)
.

)

4.4.2

Bounding Binary Excess Risk E(h)

We now combine the results from Theorem 4.8 and Corollary 4.12 to bound E(h).
Theorem 4.14. Assume α ≥ 1/2. For a failure probability δ ∈ (0, 1), define n0 as
(
n0 ≤ K3 (B, δ)

1

)1/(2τ −2τ )
1
2

Hinge,∗

LD

where K3 (B, δ) is a constant depending on B and δ, and τ1 and τ2 are defined in Theorem 4.13. Set β as that in Theorem 4.13 if n ≤ n0 and 0, otherwise. Then, with a
probability 1 − δ, we have


 K4 (B, δ)n−τ1 log n n ≤ n0
Eϕ ≤

 K (B, δ)n−1/2 log n n > n
5
0
where K4 (B, δ) and K5 (B, δ) are constants depending on B and δ.
Theorem 4.14 follows from Theorem 4.8 and similar analysis for Theorem 4.13, from
which we have
)
)
(
(
ϕ
ϕ,∗
E(h) = LD (h) − L∗D ≤ O min γ −1 , LD (h) − LD log n

Remark 4.15. According to Theorem 4.14, when the number of training examples n is not
too large, for the binary excess risk of empirical minimizer we have, with a high probability,

E(h) ≤ O(n−τ1 log n).

In the case when ξ > 0 and α > 1/2 (i.e. when the number of optimization iterations is
123

larger than

√

ϕ,∗

Hinge,∗

n and LD converges to LD

faster than 1/γ), we have τ1 > 1/2, implying

that using a smooth convex loss will lead to a generalization error bound better than O(n−1/2 )
when the number of training examples is limited. This implies that for smooth loss function to
achieve a binary excess error to the extent which is achievable by corresponding non-smooth
loss we can run the first order optimization method for a less number of iterations. This is
because our result examines the binary excess risk by taking into account the optimization
complexity.
We also note 1/(2τ1 − 2τ2 ) is given by
1+ξ
1
=
2τ1 − 2τ2
1/2 + ξ min(1, 2α − 1)
Hinge,∗ −2
] , which could be a large number when

When α ≤ 3/4, we have n0 ≥ K3 (B, δ)[LD
Hinge,∗

LD

is very small.

4.5

Proofs of Statistical Consistency

4.5.1

Proof of Theorem 4.5

We first compute
z = arg min
z′

1+η
1−η
ϕ(z ′ ; γ) +
ϕ(−z ′ ; γ)
2
2

By setting the derivative to be zero, we have
1+η
1−η
=
1 + exp(−γ(1 − z))
1 + exp(−γ(1 + z))

124

and therefore
(1 + η) exp(−γz) − (1 − η) exp(γz) + 2η exp(γ) = 0.
Solving the equation, we obtain

exp(−γz) =

−η exp(γ) +

and
exp(γz) =

η exp(γ) +

√

η 2 exp(2γ) + (1 − η 2 )
1+η

√
η 2 exp(2γ) + (1 − η 2 )
.
1−η

It is easy to verify that sgn(z) = sgn(η). This is because if η > 0, we have
√
√
1 − η2
1−η
exp(−γz) ≤
=
<1
1+η
1+η

and therefore z > 0. On the other hand, when η < 0, we have

exp(γz) =

1+η

√

√
≤
−η exp(γ) + η 2 exp(2γ) + (1 − η 2 )

1+η
< 1,
1−η

and therefore z < 0. Using the solution for z, we compute ϕ(η) as
1+η
1−η
1+η
1−η
ϕ(z; γ) +
ϕ(z; γ) − min
ϕ(z; γ) +
ϕ(z; γ)
z
2
2
2
2
1 + exp(γ(1 − z)) 1 − η
1 + exp(γ(1 + z))
1+η
log
−
log
.
= −
2γ
1 + exp(γ)
2γ
1 + exp(γ)

ψ(η; γ) =

125

By defining constants C1 = −ηeγ +

√

η 2 e2γ + 1 − η 2 and C2 = ηeγ +

√
η 2 e2γ + 1 − η 2 , we

can rewrite the transform function ψ(η; γ) as

1+η
ψ(η; γ) = −
log
2γ

4.5.2

(

[
])
(
[
])
C1
C2
1
1−η
1
γ
γ
1+e
−
log
1+e
.
1 + eγ
1+η
2γ
1 + eγ
1−η

Proof of Theorem 4.6

Since the expression for ψ(η; γ) is symmetric in terms η, we will only consider the case when
η > 0. First, we have
1−η
C1 eγ
1−η
√
≤
=
.
1+η
2η
η + η 2 + (1 − η 2 )e−2γ
Similarly, we have
C2 eγ
eγ
=
1−η
1−η

)
(
√
1 + η 2γ
γ
ηe + η 2 e2γ + 1 − η 2 ≤
e
1−η

Thus, we have
(
)
(
)
1−η
1+η γ
1+η
1+η
1−η
γ
log(1 + e ) −
log
−
log
e
ψ(η; γ) ≥
2γ
2γ
2η
2γ
1−η
(
)
(
)
1+η
1−η
1+η
1−η
≥ η−
log
−
log
2γ
2η
2γ
1−η
(
(
)
)
2
1−η
1+η
1
1
η 1
1
+
= η − log
+ +
≥ η − log
γ
4η
2
γ
4η 4 2

where the last inequality follows from the concaveness of log(·) function. As a result when
η ∈ (−1, 1) we have
η 1
1
1
+ + ≤ ,
4η 4 2
η
126

which completes the proof.

4.5.3

Proof of Theorem 4.11

Applying Lemmas 4.9 and 4.10 to the solution to the empirical convex risk minimizer h, we
have
(

)
√
2 )t
2 )t
(B
+
γB
(B
+
γB
ϕ
ϕ
ϕ
LD (h) ≤ LS (h) + K1
+ LS (h)
(4.7)
n
n
√
(
)
√
2
2
2
γB
(B + γB )t
γB 2 (B + γB 2 )t
ϕ ∗
ϕ ∗ (B + γB )t
≤ LS (hγ ) +
+ K1
+ LS (hγ )
+
n
n
(k + 2)2
(k + 2)2 n

On the other hand, by the application of the Bernstein’s inequality [30], with probability at
least 1 − δ we have
[(
4B log 1δ
ϕ
ϕ
LS (h∗γ ) − LD (h∗γ ) ≤
+
n
4B log 1δ
≤
+
n

4E(x,y)∼D
√

]

)2
ϕ
ϕ(yh∗γ (x); γ) − LD (h∗γ )
log 1δ
n

(4.8)

ϕ

8BLD (h∗γ ) log 1δ
.
n

We conclude the proof by plugging in (7.7) with (7.6), replacing the constants with a new
universal constant K, and noting that t = log 1δ + log3 n .

4.6

Summary

In this chapter we have investigated how the smoothness of loss function being used as
the surrogate of 0-1 loss function in empirical risk minimization aﬀects the excess binary
risk. While the relation between convex excess risk and binary excess risk being provably

127

established previously under weakest possible condition such as diﬀerentiability, it was not
immediately obvious how smoothness of convex surrogate relates to statistical consequences.
This chapter made first step towards understanding this aﬀect. In particular, in contrast
to optimization and generalization analysis that favor smooth surrogate losses, our results
revealed that smoothness degrades the binary excess risk. To investigate guarantees on
which the smoothness would be a desirable property, we proposed a unified analysis that
integrates errors in optimization, generalization, and translating convex excess risk into
binary excess risk. Our result shows that under favorable conditions and with appropriate
choice of smoothness parameter, a smoothed hinge loss can achieve a binary excess risk that
√
is better than O(1/ n).

128

Chapter 5
Regret Bounded by Gradual Variation
The focus so far in this thesis has been on statistical learning where we assumed that the
learner is provided with a pool of i.i.d training examples according to a fixed and unknown
distribution D from the instance space Ξ = X ×Y and is asked to output a hypothesis h ∈ H
which achieves a good generalization performance. This statistical assumption permits the
estimation of the generalization error and the uniform convergence theory provides basic
guarantees on the correctness of future predictions.
We turn now to the sequential prediction setting in which no statistical assumption is
made about the sequence of observations. In particular, we consider the online convex
optimization problem introduced in Chapter 2 where the ultimate goal is to devise eﬃcient
algorithms in adversarial environments with sub-linear regret bounds in terms of the number
of rounds the game proceeds. We have seen a wide variety of algorithms such as Follow
The Perturbed Leader (FTPL) for linear and combinatorial online learning problems, and a
simple Online Gradient Descent (OGD), Follow The Regularized Leader (FTRL), and Online
√
Mirror Descent (OMD) algorithms for general convex functions which attain an O( T ) and
O(log T ) regret bounds for Lipschitz continuous and strongly convex functions, respectively.
Most previous works, including those discussed above, considered the most general setting in which the loss functions could be arbitrary and possibly chosen in an adversarial
way. However, the environments around us may not always be fully adversarial, and the loss
functions may have some patterns which can be exploited for achieving a smaller regret. For
129

example, the weather condition or the stock price at one moment may have some correlation
with the next and their diﬀerence is usually small, while abrupt changes only occur sporadically. Consequently, it is objected that requiring an algorithm to have a small regret for all
sequences leads to results that are too loose to be practically interesting, and the bounds
obtained for worst case scenarios become pessimistic for these regular sequences. Recently,
it has been shown that the regret of the FTRL algorithm for online linear optimization can
be bounded by the total variation of the cost vectors rather than the number of rounds. This
result is appealing for the scenarios where the sequence of loss functions have a pattern and
are not fully adversarial.
In this chapter we extend this result to general online convex optimization and introduce
a new measure referred to as gradual variation to capture the variation of consecutive convex
functions. We show that the total variation bound is not necessarily small when the cost
functions change slowly, and the gradual variation lower bounds the total variation. To
establish the main results, we discuss a lower bound on the performance of the FTRL that
maintains only one sequence of solutions, and a necessary condition on smoothness of the cost
functions for obtaining a gradual variation bound. We then present two novel algorithms,
improved FTRL and Online Mirror Prox (OMP), that bound the regret by the gradual
variation of cost functions. Unlike previous approaches that maintain a single sequence
of solutions, the proposed algorithms maintain two sequences of solutions that makes it
possible to achieve a gradual variation-based regret bound for online convex optimization.
We also extend the main results two-fold: (i) we present a general method to obtain a gradual
variation bound measured by general norms rather than the ℓ2 norm and specialize it to three
online learning settings, namely online linear optimization, prediction with expert advice,
and online strictly convex optimization; (ii) we develop a deterministic algorithm for online
130

bandit optimization in multipoint bandit setting based on the proposed OMP algorithm.

5.1

Variational Regret Bounds

Recall that in online convex optimization problem, at each trial t, the learner is asked to
predict the decision vector wt that belongs to a bounded closed convex set W ⊆ Rd ; it then
receives a cost function ft : W → R+ from a family of convex functions F and incurs a cost
of ft (wt ) for the submitted solution. The goal of online convex optimization is to come up
with a sequence of solutions w1 , . . . , wT ∈ W that minimizes the regret, which is defined as
the diﬀerence in the cost of the sequence of decisions accumulated up to the trial T made
by the learner and the cost of the best fixed decision in hindsight, i.e.

RegretT =

T
∑

ft (wt ) − min

w∈W

t=1

T
∑

ft (w).

(5.1)

t=1

The goal of online convex optimization is to design algorithms that predict, with a small
regret, the solution wt at the tth trial given the (partial) knowledge about the past cost
functions f1 , f2 , · · · , ft−1 ∈ F .
As already mentioned, generally most previous studies of online convex optimization
bound the regret in terms of the number of trials T . In particular for general convex Lipschitz
√
continuous and strongly convex functions regret bounds of O( T ) and O(log T ) have been
established, respectively, which are known to minimax optimal. However, it is expected
that the regret should be low in an unchanging environment or when the cost functions are
somehow correlated. Ideally, the tightest rate for the regret should depend on the variance
of the sequence of cost functions rather than the number of rounds T . Consequently, the

131

bounds obtained for worst case scenarios in terms of number of iterations become pessimistic
for these regular sequences and too loose to be practically interesting. Therefore, it is of
great interest to derive a variation-based regret bound for online convex optimization in an
adversarial setting.
Recently [69] made a substantial progress in this route and proved a variation-based
regret bound for online linear optimization by tight analysis of FTRL algorithm with an
appropriately chosen step size. A similar regret bound is shown in the same paper for
prediction from expert advice by slightly modifying the multiplicative weights algorithm. In
this chapter, we take one step further and contribute to this research direction by developing
algorithms for general framework of online convex optimization with variation-based regret
bounds.
When all the cost functions are linear, i.e., ft (w) = ⟨ft , w⟩, where ft ∈ Rd is the cost vector in trial t, online convex optimization becomes online linear optimization. Many decision
problems can be cast into online linear optimization problem, such as prediction from expert
advice [37], online shortest path problem [142]. The first variation-based regret bound for
online linear optimization problems in an adversarial setting has been shown in [69]. Hazan
and Kale’s algorithm for online linear optimization is based on the framework of FTRL. At
each trial, the decision vector wt is given by solving the following optimization problem:

wt = arg min

w∈W

t−1
∑

⟨fτ , w⟩ +

τ =1

1
∥w∥22 ,
2η

(5.2)

where ft is the cost vector received at trial t after predicting the decision wt , and η is a step

132

size. They bound the regret by the variation of cost vectors defined as

VariationT =

T
∑

∥ft − µ∥22 ,

(5.3)

t=1

where µ = 1/T

∑T

t=1 ft . By assuming ∥ft ∥2 ≤ 1, ∀t and setting the value of η to η =

√
min(2/ VariationT , 1/6), they showed that the regret of FTRL can be bounded by
T
∑
t=1

⟨ft , wt ⟩ − min

w∈W

T
∑

⟨ft , w⟩ ≤

t=1


√

 15 VariationT

 150

√
VariationT ≥ 12
.
√
if VariationT ≤ 12

if

(5.4)

From (5.4), we can see that when the variation of the cost vectors is small (less than 12), the
√
regret is a constant, otherwise it is bounded by the variation O( VariationT ). This result
indicates that online linear optimization in the adversarial setting is as eﬃcient as in the
stationary stochastic setting.

5.2

Gradual Variation and Necessity of Smoothness

Here we introduce a new measure to characterize the eﬃciency of online learning algorithms
in evolving environments which is termed as gradual variation. The motivation of defining
gradual variation stems from two observations: one is practical and the other one is technical
raised by the limitation of extending the results in [69] to general convex functions. From
practical point of view, we are interested in a more general scenario, in which the environment
may be evolving but in a somewhat gradual way. For example, the weather condition or the
stock price at one moment may have some correlation with the next and their diﬀerence is
usually small, while abrupt changes only occur sporadically.

133

Algorithm 2 Linearalized Follow The Regularized Leader for OCO
1: Input: η > 0
2: Initialize: w1 = 0
3: for t = 1, . . . , T do
4:
Predict wt by
t−1
∑
1
wt = arg min
⟨fτ , w⟩ + ∥w∥22
2η
w∈W
τ =1

5:
Receive cost function ft (·) and incur loss ft (wt )
6:
Compute ft = ∇ft (wt )
7: end for

In order to understand the limitation of extending the results in [69], let us apply the
results to general convex loss functions. This is an important problem in its own as online
convex optimization generalizes online linear optimization by replacing linear cost functions
with non-linear convex cost functions and covers many other sequential decision making
problems. For instance, it has found applications in portfolio management [6] and online
classification [88]. In online portfolio management problem, an investor wants to distribute
his wealth over a set of stocks without knowing the market output in advance. If we let wt
denote the distribution on the stocks and rt denote the price relative vector, i.e., rt [i] denote
the the ratio of the closing price of stock i on day t to the closing price on day t − 1, then
an interesting function is the logarithmic growth ratio, i.e.

∑T

t=1 log(⟨wt , rt ⟩), which is a

concave function to be maximized.
Since the results in [69] were developed for linear loss functions, a straightforward approach is to use the first order approximation for convex loss functions, i.e., ft (w) ≃
ft (wt ) + ⟨∇ft (wt ), w − wt ⟩, and replace the linear loss vector with the gradient of the
loss function ft (w) at wt . The resulting algorithm is shown in Algorithm 2. Using the

134

convexity of loss function ft (w), we have
T
∑

ft (wt ) − min

w∈W

t=1

T
∑

T
T
∑
∑
ft (w) ≤
⟨ft , wt ⟩ − min
⟨ft , w⟩.

t=1

w∈W

t=1

(5.5)

t=1

If we assume ∥∇ft (w)∥2 ≤ 1, ∀t ∈ [T ], ∀w ∈ W, we can apply Hazan and Kale’s variationbased bound in (5.4) to bound the regret in (5.5) by the variation of the cost functions
as:
VariationT =

T
∑

∥ft − µ∥22 =

t=1

T
∑
t=1

T
1 ∑
∇ft (wt ) −
∇fτ (wτ )
T
τ =1

2

.

(5.6)

2

To better understand VariationT in (5.6), we rewrite it as

VariationT =

T
∑
t=1

=

1
2T

T
1 ∑
∇fτ (wτ )
∇ft (wt ) −
T
τ =1

T
∑

2

2

∥∇ft (wt ) − ∇fτ (wτ )∥2

t,τ =1

T T
T T
1 ∑∑
1 ∑∑
2
≤
∥∇ft (wt ) − ∇ft (wτ )∥2 +
∥∇ft (wτ ) − ∇fτ (wτ )∥22
T
T
t=1 τ =1

t=1 τ =1

= Variation1T + Variation2T .
We see that the variation VariationT is bounded by two parts: Variation1T essentially measures the smoothness of individual cost functions, while Variation2T measures the variation
in the gradients of cost functions. Let us consider an easy setting when all cost functions are

135

identical. In this case, Variation2T vanishes, and VariationT is equal to Variation1T /2, i.e.,

VariationT =

T
1 ∑
∥∇ft (wt ) − ∇fτ (wτ )∥2
2T
t,τ =1

T
1 ∑
=
∥∇ft (wt ) − ∇ft (wτ )∥2
2T
t,τ =1

=

Variation1T
.
2

As a result, the regret of the FTRL algorithm for online convex optimization may still be
√
bounded by O( T ) regardless of the smoothness of the cost functions.
To address this challenge, we develop two novel algorithms for online convex optimization
that bound the regret by the variation of cost functions. In particular, we would like to bound
the regret of online convex optimization by the variation of cost functions defined as follows:

GradualVariationT =

T∑
−1
t=1

max ∥∇ft+1 (w) − ∇ft (w)∥22 .

w∈W

(5.7)

Note that the variation in (5.7) is defined in terms of gradual diﬀerence between individual
cost function to its previous one, while the variation in (5.3) is defined in terms of total
diﬀerence between individual cost vectors to their mean. Therefore we refer to the variation defined in (5.7) as gradual variation, and to the variation defined in (5.3) as total
variation. It is straightforward to show that when ft (w) = ⟨ft , w⟩, the gradual variation
GradualVariationT is upper bounded by the total variation VariationT defined with a constant factor:
T∑
−1
t=1

∥ft+1 − ft ∥22 ≤

T∑
−1

2∥ft+1 − µ∥22 + 2∥ft − µ∥22 ≤ 4

t=1

T
∑
t=1

136

∥ft − µ∥22 .

On the other hand, we can not bound the total variation by the gradual variation up to a
constant. This is verified by the following example. Let us assume that the adversary plays
a fixed function f for the first half of the iterations and another diﬀerent function g for the
second half of the iterations, i.e., f1 = · · · = fT /2 = f and fT /2+1 = · · · = fT = g ̸= f . Then,
in this simple scenario the total variation of the sequence of cost functions in (5.3) is given
by
VariationT =

T
∑
t=1

T
∥ft − µ∥22 =
2

f +g 2 T
f +g 2
f−
+
g−
= Ω(T ),
2
2
2
2
2

while the gradual variation defined in (5.7) is a constant given by

GradualVariationT =

T∑
−1

∥ft+1 − ft ∥22 = ∥f − g∥22 = O(1).

t=1

Based on the above analysis, we claim that the regret bound by gradual variation is
usually tighter than total variation. In particular, the following theorem shows a lower
bound on the performance of the FTRL in terms of gradual variation. Unlike the standard
setting of online learning where the FTLR achieves the optimal regret bound for Lipschitz
continuous and strongly convex losses, it is not capable of achieving regret bounded by
gradual variation. The result of this theorem motivates us to develop new algorithms for
√
online convex optimization to achieve a gradual variation bound of O( GradualVariationT ).
For the ease of exposition we use GVT to denote the gradual variation after T iterations.
Theorem 5.1. The regret of FTRL is at least Ω(min(GVT ,

√

T )).

Proof. Let f be any unit vector passing through w1 . Let s = ⌊1/η⌋, so that if we use ft = f
for every t ≤ s, each such zt+1 = w1 − tηf still remains in W and thus wt+1 = zt+1 . Next,
we analyze the regret by considering the following three cases depending on the range of s.
137

√
T.
√
First, when s ≥ T , we choose ft = f for t from 1 to ⌊s/2⌋ and ft = 0 for the remaining
Case I: s ≥

t. Clearly, the best strategy of the oﬄine algorithm is to play w = −f . On the other
hand, since the learning rate η is too small, the strategy wt played by GD, for t ≤ ⌊s/2⌋,
is far away from w, so that ⟨ft , wt − w⟩ ≥ 1 − tη ≥ 1/2. Therefore, the regret is at least
√
⌊s/2⌋ (1/2) = Ω( T ).
√
Case II: 0 < s < T .
√
Second, when 0 < s < T , the learning rate is high enough so that FTRL may overreact to
each loss vector, and we make it pay by flipping the direction of loss vectors frequently. More
precisely, we use the vector f for the first s rounds so that wt+1 = w1 − tηf for any t ≤ s,
but just as ws+1 moves far enough in the direction of −f , we make it pay by switching the
loss vector to −f , which we continue to use for s rounds. Note that ws+1+r = ws+1−r but
∑
fs+1+r = −fs+1−r for any r ≤ s, so 2s
t=1 ⟨ft , wt − w1 ⟩ = ⟨fs+1 , ws+1 − w1 ⟩ ≥ Ω(1). As
w2s+1 returns back to w1 , we can see the first 2s rounds as a period, which only contributes
∥2f ∥22 = 4 to the deviation. Then we repeat the period for τ times, where τ = ⌊GVT /4⌋ if
there are enough rounds, with ⌊T /(2s)⌋ ≥ ⌊GVT /4⌋, to use up the gradual variation GVT ,
and τ = ⌊T /(2s)⌋ otherwise. For any remaining round t, we simply choose ft = 0. As a
√
result, the total regret is at least Ω(1) · τ = Ω(min{GVT /4, T /(2s)}) = Ω(min{GVT , T }).
Case III: s = 0.
Finally, when s = 0, the learning rate is so high that we can easily make GD pay by flipping
the direction of the loss vector in each round. More precisely, by starting with f1 = −f ,
we can have w2 on the boundary of W, which means that if we then alternate between f
and −f , the strategies FTRL plays will alternate between w3 and w2 which have a constant
distance from each other. Then following the analysis in the second case, one can show that
138

the total regret is at least Ω(min{GVT , T }).
Assumption 5.2. In this study, we assume smooth cost functions with Lipschitz continuous
gradients, i.e., there exists a constant L > 0 such that

∥∇ft (w) − ∇ft (w′ )∥2 ≤ L∥w − w′ ∥2 , ∀w, w′ ∈ W, ∀t.

(5.8)

We would like to emphasize that our assumption about the smoothness of cost functions
is necessary to achieve the variation-based bound stated in this chapter. To see this, consider
the special case of f1 (w) = · · · = fT (w) = f (w). If we are able to achieve a regret bound
which scales as the square roof of the gradual variation, for any sequence of convex functions,
then for the special case where all the cost functions are identical, we have
T
∑
t=1

implying that wT =

f (wt ) ≤ min

w∈W

T
∑

f (w) + O(1),

t=1

∑T

t=1 wt /T approaches the optimal solution at the rate of O(1/T ).

√
This contradicts the lower complexity bound (i.e., O(1/ T )) for any optimization method
which only uses first order information about the cost functions [119, Theorem 3.2.1] (see also
Table 2.1). This analysis indicates that the smoothness assumption is necessary to attain
variation based regret bound for general online convex optimization problem. We would
like to emphasize the fact that this contradiction holds when only the gradient information
about the cost functions is provided to the learner and the learner may be able to achieve
a variation-based bound using second order information about the cost functions, which is
not the focus of this chapter.

139

−ηft

wt+1 = wt − ηft
ˆ t = wt − ηft−1
w
−ηft−1
wt = wt−1 − ηft−1

W

−ηft−1

.

Figure 5.1: Illustration of the main idea behind the proposed improved FTRL and online
mirror prox methods to attain regret bounds in terms of gradual variation for linear loss
ˆ t instead of wt to suﬀer less regret when the
functions. The learner plays the decision w
consecutive loss functions are gradually evolving.

5.3

The Improved FTRL Algorithm

As mentioned earlier, the ultimate goal of this chapter is to have algorithms that can take
advantage of benign sequences in gradually evolving environments and at the same time
protect against the adversarial sequences. However, the impossibility result we showed in the
previous section and in particular Theorem 5.1, demonstrated that the existing algorithms
such as OGD and in general the family of follow the regularized leader algorithms fail to
attain a regret bounded by the gradual variation of the loss functions. Motivated by this
negative result, we now turn to proposing two algorithms for online convex optimization
140

that are able to attain regret bounds in terms of gradual variation. The first algorithm
is an improved FTRL and the second one is based on the mirror prox method introduced
in Chapter 2. One common feature shared by the two algorithms is that both of them
maintain two sequences of solutions: decision vectors w1:T = (w1 , · · · , wT ) and searching
vectors z1:T = (z1 , · · · , zT ) that facilitate the updates of decision vectors. Both algorithms
share almost the same regret bound except for a constant factor.
All of our algorithms are based on the following idea, which we illustrate using the online
linear optimization problem as an example which is graphically depicted in Figure 5.1. For
general linear functions, the online gradient descent algorithm is known to achieve an optimal
regret, which plays in round t the point wt = ΠW (wt−1 − ηft−1 ). Now, if the loss functions
have a small deviation, ft−1 may be close to ft , so in round t, it may be a good idea to play
a point which moves further in the direction of −ft−1 as it may make its inner product with
ft (which is its loss with respect to ft ) smaller. In fact, it can be shown that if one could
play the point wt+1 = ΠW (wt − ηft ) in round t, a very small regret could be achieved,
i.e.,

∑T

t=1 ⟨wt+1 , ft ⟩ − minw∈W

∑T

t=1 ⟨w, ft ⟩ ≤ O(1), but in reality one does not have ft

available before round t to compute wt+1 . On the other hand, if ft−1 is a good estimate of
ˆ t = ΠW (wt − ηft−1 ) should be a good estimate of wt+1 too. The point w
ˆt
ft , the point w
ˆt
can actually be computed before round t since ft−1 is available, so our algorithm plays w
in round t. As it will be clear later in this chapter, our algorithms for the prediction with
expert advice problem and the online convex optimization problem use the same idea to be
able to achieve regret bounds stated in terms of gradual variation of the sequence of losses.
To facilitate the discussion, besides the variation of cost functions defined in (5.7), we

141

Algorithm 3 Improved FTRL (IFTRL) Algorithm
1: Input: η ∈ (0, 1]
2: Initialize:: z0 = 0 and f0 (w) = 0
3: for t = 1, . . . , T do
4:
Predict wt by
{
}
L
2
wt = arg min ⟨w, ∇ft−1 (zt−1 ⟩) + ∥w − zt−1 ∥2
2η
w∈W
5:
6:

Receive cost function ft (·) and incur loss ft (wt )
Update zt by
{
}
t
∑
L
zt = arg min ⟨w,
∇fτ (zτ −1 )⟩ + ∥w∥22
2η
w∈W
τ =1

7: end for

define another variation, named extended gradual variation, as follows

EGVT,2 (y1:T ) =

T∑
−1

∥∇ft+1 (yt ) − ∇ft (yt )∥22 ≤ ∥∇f1 (y0 )∥22 + GVT ,

(5.9)

t=0

where f0 (w) = 0, the sequence (y0 , . . . , yT ) is either (z0 , . . . , zT ) (as in the improved FTRL)
or (w0 , . . . , wT ) (as in the online mirror prox method) and the subscript 2 means the variation is defined with respect to ℓ2 norm. When all cost functions are identical, GVT becomes
zero and the extended variation EGVT,2 (y1:T ) is reduced to ∥∇f1 (y0 )∥22 , a constant independent from the number of trials. In the sequel, we use the notation EGVT,2 for simplicity.
Our results show that for online convex optimization with L-smooth cost functions, the
regrets of the proposed algorithms can be bounded as follows
T
∑
t=1

ft (wt ) − min

w∈W

T
∑

ft (w) ≤ O

(√

)
EGVT,2 + constant.

(5.10)

t=1

We now turn to presenting our first algorithm, dubbed IFTRL, which is a simple modification of the FTRL algorithm and show that its regret bounded by the gradual variation.
142

The improved FTRL algorithm for online convex optimization is presented in Algorithm 3.
Without loss of generality, we assume that the decision set W is contained in a unit ball
B = {x ∈ Rd : ∥x∥ ≤ 1}, i.e., W ⊆ B, and 0 ∈ W. Note that in step 6, the searching vectors
zt are updated according to the FTRL algorithm after receiving the cost function ft (w). To
understand the updating procedure for the decision vector wt specified in step 4, we rewrite
it as
{

}
L
2
wt = arg min ft−1 (zt−1 ) + ⟨w − zt−1 , ∇ft−1 (zt−1 )⟩ + ∥w − zt−1 ∥2 .
2η
w∈W

(5.11)

Notice that
L
∥w − zt−1 ∥22
2
L
≤ ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ + ∥w − zt−1 ∥22 ,
2η

ft (w) ≤ ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ +

where the first inequality follows the smoothness condition in (5.8) and the second inequality follows from the fact η ≤ 1. The inequality (5.12) provides an upper bound for
ft (w) and therefore can be used as an approximation of ft (w) for predicting wt . However,
since ∇ft (zt−1 ) is unknown before the prediction, we use ∇ft−1 (zt−1 ) as a surrogate for
∇ft (zt−1 ), leading to the updating rule in (5.11). It is this approximation that leads to the
variation bound. The following theorem states the regret bound of Algorithm 3.
Theorem 5.3. Let f1 , f2 , . . . , fT be a sequence of convex functions with L-Lipschitz contin{
}
√
uous gradients. By setting η = min 1, L/ EGVT,2 , we have the following regret bound

143

for the IFTRL in Algorithm 3:
T
∑

ft (wt ) − min

t=1

w∈W

T
∑

)
( √
ft (w) ≤ max L, EGVT,2 .

t=1

Remark 5.4. Comparing with the variation bound in (5.7) for the FTRL algorithm, the
smoothness parameter L plays the same role as Variation1T that accounts for the smoothness
of cost functions, and term EGVT,2 plays the same role as Variation2T that accounts for
the variation in the cost functions. Compared to the FTRL algorithm, the key advantage
of the improved FTRL algorithm is that the regret bound is reduced to a constant when the
cost functions change only by a constant number of times along the horizon. Of course,
the extended variation EGVT,2 may not be known apriori for setting the optimal η, we can
apply the standard doubling trick [38] to obtain a bound that holds uniformly over time and
is a factor at most 8 from the bound obtained with the optimal choice of η. The details are
provided later in this chapter.

5.4

The Online Mirror Prox Algorithm

The second algorithm we present to attain regret bounds in terms of gradual variation is
based on the prox method we introduced in Chapter 2 for non-smooth convex optimization.
We generalize the prox method for online convex optimization that shares the same order
of regret bound as the improved FTRL algorithm. The detailed steps of the Online Mirror
Prox (OMP) method are shown in Algorithm 4, where we use an equivalent form of updates
for wt and zt in order to compare to Algorithm 3 . The OMP method is closely related to
the prox method in [117] by maintaining two sets of vectors w1:T and z1:T , where wt and

144

Algorithm 4 Online Mirror Prox (OMP) Algorithm
1: Input: η > 0
2: Initialize:: z0 = w0 = 0 and f0 (w) = 0
3: for t = 1, . . . , T do
4:
Predict wt by
{
}
L
2
wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + ∥w − zt−1 ∥2
2η
w∈W
5:
6:

Receive cost function ft (·) and incur loss ft (wt )
Update zt by
}
{
L
2
zt = arg min ⟨w, ∇ft (wt )⟩ + ∥w − zt−1 ∥2
2η
w∈W

7: end for

zt are computed by gradient mappings using ∇ft−1 (wt−1 ), and ∇ft (wt ), respectively, as
(
1
w − zt−1 −
w∈W 2
(
1
zt = arg min
w − zt−1 −
w∈W 2

wt = arg min

) 2
η
∇ft−1 (wt−1 )
L
2
)
2
η
∇ft (wt )
L
2

The OMP diﬀers from the IFTRL algorithm: (i) in updating the searching points zt ,
Algorithm 3 updates zt by the FTRL scheme using all the gradients of the cost functions
at {zτ }t−1
τ =1 , while OMP updates zt by a prox method using a single gradient ∇ft (wt ), and
(ii) in updating the decision vector wt , OMP uses the gradient ∇ft−1 (wt−1 ) instead of
∇ft−1 (zt−1 ). The advantage of OMP algorithm compared to the IFTRL algorithm is that
it only requires to compute one gradient ∇ft (wt ) for each loss function; in contrast, the
improved FTRL algorithm in Algorithm 3 needs to compute the gradients of ft (w) at two
searching points zt and zt−1 . It is these diﬀerences that make it easier to extend the OMP
to a bandit setting, which will be discussed in Section 5.7.
The following theorem states the regret bound of the online mirror prox method for online
convex optimization.

145

Algorithm 5 Online Mirror Prox Method for General Norms
1: Input: η > 0, Φ(z)
2: Initialize:: z0 = w0 = minz∈W Φ(z) and f0 (w) = 0
3: for t = 1, . . . , T do
4:
Predict wt by
{
}
L
wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 )
η
w∈W
5:
6:

Receive cost function ft (·) and incur loss ft (wt )
Update zt by
{
}
L
zt = arg min ⟨w, ∇ft (wt )⟩ + B(w, zt−1 )
η
w∈W

7: end for

Theorem 5.5. Let f1 , f2 , . . . , fT be a sequence of convex functions with L-Lipschitz contin{ √
}
√
uous gradients. By setting η = (1/2) min 1/ 2, L/ EGVT,2 , we have the following regret
bound for OMP in Algorithm 4
T
∑

ft (wt ) − min

t=1

w∈W

T
∑

ft (w) ≤ 2 max

(√

)
√
2L, EGVT,2 .

t=1

We note that compared to Theorem 5.3, the regret bound in Theorem 5.5 is slightly
worse by a factor of 2.

5.4.1

Online Mirror Prox Method with General Norms

In this subsection, we first present a general OMP method to obtain a variation bound
defined in a general norm. Then we discuss three special cases: online linear optimization,
prediction with expert advice, and online strictly convex optimization. The omitted proofs in
this subsection can be easily duplicated by mimicking the proof of Theorem 5.5, if necessary
with the help of previous analysis as mentioned in the appropriate text.
To adapt OMP to general norms other than the Euclidean norm, let ∥ · ∥ denote a general
146

norm, ∥ · ∥∗ denote its dual norm, Φ(z) be a α-strongly convex function with respect to the
(
)
norm ∥ · ∥, and B(w, z) = Φ(w) − Φ(z) + ⟨w − z, Φ′ (z)⟩ be the Bregman distance induced
by function Φ(w). Let f1 , f2 , · · · , fT be a sequence of smooth functions with Lipschitz
continuous gradients bounded by L with respect to norm ∥ · ∥, i.e.,

∥∇ft (w) − ∇ft (w′ )∥∗ ≤ L∥w − w′ ∥.

(5.12)

Correspondingly, we define the extended gradual variation based on the general norm as
follows:
EGVT =

T∑
−1

∥∇ft+1 (wt ) − ∇ft (wt )∥2∗ .

(5.13)

t=0

Algorithm 5 gives the detailed steps for the general framework. We note that the key
diﬀerences from Algorithm 4 are: z0 is set to minz∈W Φ(z), and the Euclidean distances in
steps 4 and 6 are replaced by Bregman distances, i.e.,
{

}
L
wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 ) ,
η
w∈W
{
}
L
zt = arg min ⟨w, ∇ft (wt )⟩ + B(w, zt−1 ) .
η
w∈W

(5.14)

The following theorem states the variation-based regret bound for the general norm
framework, where R measure the size of W defined as
√
R=

2( max Φ(w) − min Φ(w)).
w∈W

w∈W

Theorem 5.6. Let f1 , f2 , · · · , fT be a sequence of convex functions whose gradients are Lsmooth continuous, Φ(z) be a α-strongly convex function, both with respect to norm ∥ · ∥, and
{√ √
}
√
EGVT be defined in (5.13). By setting η = (1/2) min
α/ 2, LR/ EGVT , we have the
147

following regret bound
T
∑

ft (wt ) − min

T
∑

w∈W

t=1

ft (w) ≤ 2R max

(√

)
√ √
2LR/ α, EGVT .

t=1

In the following subsections we specialize the proposed general method to few specific
online learning settings.

5.4.2

Online Linear Optimization

Here we consider online linear optimization and present the algorithm and the gradual variation bound for this setting as a special case of proposed algorithm. In particular, we are
interested in bounding the regret by the gradual variation

f
EGVT,2 =

T∑
−1

∥ft+1 − ft ∥22 ,

t=0

where ft , t = 1, . . . , T are the linear cost vectors and f0 = 0. Since linear functions are
smooth functions that satisfy the inequality in (5.8) for any positive L > 0, therefore we can
apply Algorithm 4 to online linear optimization with any positive value for L 1 . The regret
bound of Algorithm 4 for online linear optimization is presented in the following corollary.
Corollary 5.7. Let ft (w) = ⟨ft , w⟩, t = 1, . . . , T be a sequence of linear functions. By
√ (
)
setting η =

f

1/ 2EGVT,2 and L = 1 in Algorithm 4, then we have
T
∑
t=1

ft⊤ wt − min
w∈W

T
∑

√
⟨ft , w⟩ ≤

f

2EGVT,2 .

t=1

1 We simply set L = 1 for online linear optimization and prediction with expert advice.

148

Remark 5.8. Note that the regret bound in Corollary 5.7 is stronger than the regret bound
obtained in [71] for online linear optimization due to the fact that the gradual variation is
smaller than the total variation.

5.4.3

Prediction with Expert Advice

In the problem of prediction with expert advice, the decision vector w is a distribution over
m experts, i.e., w ∈ W = {w ∈ Rm
+ :

∑m

m
i=1 wi = 1}. Let ft ∈ R denote the costs for

m experts in trial t. Similar to [69], we would like to bound the regret of prediction from
expert advice by the gradual variation defined in infinite norm, i.e.,

f
EGVT,∞ =

T∑
−1

∥ft+1 − ft ∥2∞ .

t=0

Since it is a special online linear optimization problem, we can apply Algorithm 4 to obtain
a regret bound as in Corollary 5.7, i.e.,
T
∑

ft⊤ wt − min

w∈W

t=1

T
∑

√

√
⟨ft , w⟩ ≤

f
2EGVT,2 ≤

f

2mEGVT,∞ .

t=1

However, the above regret bound scales badly with the number of experts. We can obtain
√
f
a better regret bound in O( EGVT,∞ ln m) by applying the general prox method in Algorithm 5 with Φ(w) =

∑m

i=1 wi ln wi and B(w, z) =

∑m

i=1 wi ln(zi /wi ). The two updates in

Algorithm 5 become
i exp([η/L]f i )
zt−1
t−1
i
wt = ∑
, i = 1, . . . , m
j
j
m
j=1 zt−1 exp([η/L]ft−1 )
i exp([η/L]f i )
zt−1
t
i
zt = ∑
, i = 1, . . . , m.
j
j
m
j=1 zt−1 exp([η/L]ft )

149

(5.15)

The resulting regret bound is formally stated in the following Corollary.
Corollary 5.9. Let ft (w) = ⟨ft , w⟩, t = 1, . . . , T be a sequence of linear functions in pre√
∑
f
diction with expert advice. By setting η = (ln m)/EGVT,∞ , L = 1, Φ(w) = m
i=1 wi ln wi
and B(w, z) =

∑m

i=1 wi ln(wi /zi ) in Algorithm 5, we have

√
T
T
∑
∑
f
⟨ft , wt ⟩ − min
⟨ft , w⟩ ≤ 2EGVT,∞ ln m.
w∈W

t=1

t=1

By noting the definition of EGVfT,∞ , the regret bound in Corollary 5.9 is

O

T∑
−1
t=0


i − f i | ln m ,
max |ft+1
t
i

which is similar to the regret bound obtained in [69] for prediction with expert advice. However, the definitions of the variation are not exactly the same. In [69], the authors bound the
(√
)
∑T
i
i
2
regret of prediction with expert advice by O
ln m maxi t=1 |ft − µt | + ln m , where
the variation is the maximum total variation over all experts. To compare the two regret
bounds, we first consider two extreme cases. When the costs of all experts are the same,
then the variation in Corollary 5.9 is a standard gradual variation, while the variation in
[69] is a standard total variation. According to the previous analysis, a gradual variation
is smaller than a total variation, therefore the regret bound in Corollary 5.9 is better than
that in [69]. In another extreme case when the costs at all iterations of each expert are the
same, both regret bounds are constants. More generally, if we assume the maximum total
variation is small (say a constant), then
By a trivial analysis

∑T −1

i
i
t=0 |ft+1 − ft | is also a constant for any i ∈ [m].

∑T −1

∑T −1 i
i
i
i
t=0 maxi |ft+1 − ft | ≤ m maxi
t=0 |ft+1 − ft |, the regret bound

in Corollary 5.9 might be worse up to a factor

√

150

m than that in [69].

Remark 5.10. It was shown in [41], both the regret bounds in Corollary 5.7 and Corollary 5.9 are optimal because they match the lower bounds for a special sequence of loss
functions. In particular, for online linear optimization if all loss functions but the first
√
√
f
Tk = EGVT,2 are all-0 functions, then the known lower bound Ω( Tk ) matches the upper
bound in Corollary 5.7. Similarly, for prediction from expert advice if all loss functions but
√
√
f
′
the first Tk = EGVT,∞ are all-0 functions, then the known lower bound Ω( Tk′ ln m) [38]
matches the upper bound in Corollary 5.9.

5.4.4

Online Strictly Convex Optimization

In this subsection, we present an algorithm to achieve a logarithmic variation bound for
online strictly convex optimization. In particular, we assume the cost functions ft (w) are
not only smooth but also strictly convex defined formally in the following.
Definition 5.11. For β > 0, a function f (w) : W → R is β-strictly convex if for any
w, z ∈ W

f (w) ≥ f (z) + ∇⟨f (z), w − z⟩ + β(w − z)⊤ ∇f (z)∇f (z)⊤ (w − z)

(5.16)

It is known that such a defined strictly convex function include strongly convex function
and exponential concave function as special cases as long as the gradient of the function
is bounded. To see this, if f (w) is a β ′ -strongly convex function with a bounded gradient
∥∇f (w)∥2 ≤ G, then
f (w) ≥ f (z) + ⟨∇f (z), w − z⟩ + β ′ ⟨w − z, w − z⟩
β′
≥ f (z) + ⟨∇f (z), w − z⟩ + 2 (w − z)⊤ ∇f (z)∇f (z)⊤ (w − z),
G
151

(5.17)

thus f (w) is a (β ′ /G2 ) strictly convex. Similarly if f (w) is exp-concave, i.e., there exists
α > 0 such that h(w) = exp(−αf (w)) is concave, then f (w) is a β = 1/2 min(1/(4GD), α)
strictly convex (c.f. Lemma 2 in [68]), where D is defined as the diameter of the domain.
Therefore, in addition to smoothness and strict convexity we also assume all the cost functions have bounded gradients, i.e., ∥∇ft (w)∥2 ≤ G.
We now turn to deriving a logarithmic gradual variation bound for online strictly convex
optimization. To this end, we need to change the Euclidean distance function in Algorithm 4
to a generalized Euclidean distance function. Specifically, at trial t, we let Ht = I + βG2 I +
β

∑t−1

1
⊤
τ =0 ∇fτ (wτ )∇fτ (wτ ) and use the generalized Euclidean distance Bt (w, z) = 2 ∥w −

z∥2H = 21 (w − z)⊤ Ht (w − z) in updating wt and zt , i.e.,
t
{
}
1
2
wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + ∥w − zt−1 ∥H
t
2
w∈W
{
}
1
zt = arg min ⟨w, ∇ft (wt )⟩ + ∥w − zt−1 ∥2H ,
t
2
w∈W

(5.18)

To prove the regret bound, we can prove a similar inequality as in Lemma 5.16 by applying
Φ(w) = 1/2∥w∥2H , which is stated as follows
t

∇ft (wt )⊤ (wt − z) ≤ Bt (z, zt−1 ) − Bt (z, zt )
]
1[
2
2
2
∥wt − zt−1 ∥H + ∥wt − zt ∥H .
+ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 −
t
t
Ht
2
Then by applying inequality in (5.16) for strictly convex functions, we obtain the following

152

ft (wt ) − ft (z) ≤ Bt (w, zt−1 ) − Bt (w, zt ) − β∥wt − z∥2M
t
]
[
1
+ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2 −1 −
∥wt − zt−1 ∥2H + ∥wt − zt ∥2H ,
t
t
Ht
2
where Mt = ∇ft (wt )∇ft (wt )⊤ and Ht = I + βG2 I + β

(5.19)

∑t−1

τ =0 Mτ as defined above. The

following corollary shows that general OMP method attains a logarithmic gradual variation
bound and its proof is deferred to later.
Corollary 5.12. Let f1 , f2 , . . . , fT be a sequence of β-strictly convex and L-smooth functions
√
with gradients bounded by G. We assume 8dL2 ≥ 1, otherwise we can set L = 1/(8d). An
algorithm that adopts the updates in (5.18) has a regret bounded by
T
∑

ft (wt ) − min

w∈W

t=1

where EGVT,2 =

5.4.5

T
∑
t=1

ft (w) ≤

1 + βG2 8d
+
ln max(16dL2 , βEGVT,2 ),
2
β

∑T −1

2
t=0 ∥∇ft+1 (wt ) − ∇ft (wt )∥2 and d is the dimension of w ∈ W.

Gradual Variation Bounds which Hold Uniformly over Time

As mentioned in Remark 5.4, the algorithms presented in this chapter rely on the previous
knowledge of the gradual variation EGVT,2 to tune the learning rate η to obtain the optimal
bound. Here, we show that the Algorithm 3 can be used as a black-box to achieve the same
regret bound but without any prior knowledge of the EGVT,2 . We note that the analysis
here is not specific to Algorithm 3 and it is general enough to be adapted to other algorithms
in the chapter too.
The main idea is to run the algorithm in epochs with a fixed learning rate ηk = η0 /2k for

153

kth epoch where η0 is a fixed constant and will be decided by analysis. We denote the number
of epochs by K and let bk denote the start of kth epoch. We note that bK+1 = T +1. Within
kth epoch, the algorithm ensures that the inequality ηk

∑bk+1 −1
t=bk

∥∇ft+1 (zt ) − ∇ft (zt )∥22 ≤

L2 ηk−1 holds.
To this end, the algorithm computes and maintains the quantity
t
∑

∥∇fs+1 (zs ) − ∇fs (zs )∥22

s=bk

and sets the beginning of new epoch to be

bk+1 = min
t

t
∑

∥∇fs+1 (zs ) − ∇fs (zs )∥22 > L2 ηk−1 ,

s=bk

i.e., the first iteration for which the invariant is violated. We note that this decision can
only be made after seeing the tth cost function. Therefor, we burn the first iteration of each
epoch which causes an extra regret of KL2 in the total regret. From the analysis we have:
T
∑
t=1

ft (wt ) − min

w∈W

T
∑

ft (w) ≤

t=1

K
∑
k=1

≤
≤

b


−1
k+1
∑
t=bk

ft (wt ) − min

bk+1 −1
∑

w∈W


ft (w)

t=bk

K
∑
η
L
+ k EGVb :b
+ KL2
k k+1 −1
2ηk 2L

k=1
K
∑
k=1

∑
L
L
+
+ K = 2L
ηk−1 + KL2
2ηk 2ηk
K

k=1

where the first inequality follows the analysis of algorithm for each epoch, the last
inequality follows the invariant maintained within each phase and the constant KL2 is
due to burning the first iteration of each epoch. We now try to upper bound the last
term. We first note that

∑K−1 −1 k
−1 K
−1 K
−1 K
−1
k=1 ηk =
k=1 η0 2 + η0 2 = η0 (2 − 1) + η0 2 ≤

∑K

154

η0−1 2K+1 . Furthermore, from bK , we know that ηK−1

∑bK

2
t=bK−1 ∥∇ft+1 (zt ) − ∇ft (zt )∥2 ≥

−1 since the b is the first iteration within epoch K − 1 which violates the invariL2 ηK−1
K

ant. Also, from the monotonicity of gradual variation one can obtain that ηK−1 EGVT,2 ≥
∑b K
−1 which indicates η −1 ≤ √EGV
∥∇ft+1 (zt ) − ∇ft (zt )∥22 ≥ L2 ηK−1
ηK−1 t=b
T,2 /L.
K−1
K−1

Putting these together, from (5.20) we obtain:

T
∑
t=1

ft (wt ) − min

w∈W

T
∑
t=1

ft (w) ≤ 2L

√
2K+1
+ KL2 ≤ 8 EGVT,2 + KL2 .
η0

(5.20)

It remains to bound the number of epochs in terms of EGVT,2 . A simple idea would
be to set K to be ⌊log EGVT,2 ⌋ + 1, since it is the maximum number of epochs that could
exists. Alternatively, we can also bound K in terms of

√

EGVT,2 which worsen the constant

factor in the bound but results in a bound similar to one obtained by setting optimal η.

5.5

Bandit Online Mirror Prox with Gradual Variation
Bounds

Online convex optimization becomes more challenging when the learner only receives partial
feedback about the cost functions. One common scenario of partial feedback is that the
learner only receives the cost ft (wt ) at the predicted point wt but without observing the
entire cost function ft (·). This setup is usually referred to as bandit setting, and the related
online learning problem is called online bandit convex optimization.
Recently Hazan et al [70] extended the FTRL algorithm to online bandit linear optimiza√
tion and obtained a variation-based regret bound in the form of O(poly(d) VariationT log(T )+
155

poly(d log(T ))), where VariationT is the total variation of the cost vectors. We continue this
line of work by proposing algorithms for general online bandit convex optimization with a
variation-based regret bound. We present a deterministic algorithm for online bandit convex
optimization by extending the OPM algorithm to a multi-point bandit setting, and prove
the variation-based regret bound, which is optimal when the variation is independent of the
number of trials. In our bandit setting , we assume we are allowed to query d + 1 points
around the decision point wt .
To develop a variation bound for online bandit convex optimization, we follow [5] by
considering the multi-point bandit setting, where at each trial the player is allowed to query
the cost functions at multiple points. We propose a deterministic algorithm to compete
against the completely adaptive adversary that can choose the cost function ft (w) with the
knowledge of w1 , · · · , wt . To approximate the gradient ∇ft (wt ), we query the cost function
to obtain the cost values at ft (wt ), and ft (wt +δei ), i = 1, · · · , d, where ei is the ith standard
base in Rd . Then we compute the estimate of the gradient ∇ft (wt ) by
1∑
gt =
(ft (wt + δei ) − ft (wt )) ei .
δ
d

(5.21)

i=1

It can be shown that [5], under the smoothness assumption in (5.8),
√
∥gt − ∇ft (wt )∥2 ≤

dLδ
.
2

(5.22)

To prove the regret bound, besides the smoothness assumption of the cost functions, and
the boundness assumption about the domain W ⊆ B, we further assume that (i) there exists
r ≤ 1 such that rB ⊆ W ⊆ B, and (ii) the cost function themselves are Lipschitz continuous,
i.e., there exists a constant G such that
156

Algorithm 6 Deterministic Online Bandit Convex Optimization
1: Input: η, α, δ > 0
2: Initialize:: z0 = 0 and f0 (w) = 0
3: for t = 1, . . . , T do
4:
Compute wt by
{
}
G
2
wt = arg min
⟨w, gt−1 ⟩ + ∥w − zt−1 ∥2
2η
w∈(1−α)W
5:
6:

Observe ft (wt ), ft (wt + δei ), i = 1, · · · , d
Update zt by
{

G
zt = arg min
⟨w, gt ⟩ + ∥w − zt−1 ∥22
2η
w∈(1−α)W

}

7: end for

|ft (w) − ft (w′ )| ≤ G∥w − w′ ∥2 , ∀w, w′ ∈ W, ∀t.

(5.23)

For our purpose, we define another gradual variation of cost functions by

EGVcT =

T∑
−1
t=0

max |ft+1 (w) − ft (w)|.

w∈W

(5.24)

Unlike the gradual variation defined in (5.9) that uses the gradient of the cost functions,
the gradual variation in (5.24) is defined according to the values of cost functions. The reason
why we bound the regret by the gradual variation defined in (5.24) by the values of the cost
functions rather than the one defined in (5.9) by the gradient of the cost functions is that
in the bandit setting, we only have point evaluations of the cost functions. The following
theorem states the regret bound for Algorithm 6.
Theorem 5.13. Let ft (·), t = 1, . . . , T be a sequence of G-Lipschitz √
continuous convex func√
√
4d max( 2G, EGVcT )
√
,
tions, and their gradients are L-Lipschitz continuous. By setting δ =
(
dL
+
G(1
+
1/r))T
{
}
δ
1
G
η =
min √ , √
, and α = δ/r, we have the following regret bound for Algo4d
EGVcT
2
157

rithm 6

T
∑

ft (wt ) − min

T
∑

w∈W

t=1

√
ft (w) ≤ 4

max

√
(√
)
2G, EGVcT d (dL + G/r) T.

(5.25)

t=1

Remark 5.14. Similar to the regret bound in [5](Theorem 9), Algorithm 6 also gives the
√
optimal regret bound O( T ) when the variation is independent of the number of trials. Our
regret bound has a better dependence on d (i.e., d) compared with the regret bound in [5] (i.e.,
d2 ).

5.6
5.6.1

Proofs of Gradual Variation
Proof of Theorem 5.3

To prove Theorem 5.3, we first present the following lemma.
Lemma 5.15. Let f1 , f2 , . . . , fT be a sequence of convex functions with L-Lipschitz continuous gradients. By running Algorithm 3 over T trials, we have
T
∑
t=1


L
ft (wt ) ≤ min  ∥w∥22 +
w∈W 2η

T
∑


ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩

t=1

T −1
η ∑
+
∥∇ft+1 (zt ) − ∇ft (zt )∥22 .
2L
t=0

With this lemma, we can easily prove Theorem 5.3 by exploring the convexity of ft (w).
Proof of Theorem 5.3. By using ∥w∥2 ≤ 1, ∀w ∈ W ⊆ B, and the convexity of ft (w), we

158

have

L

min
∥w∥22 +
w∈W  2η

T
∑
t=1




∑
L
ft (zt−1 ) + ⟨w − zt−1 , ∇ft (zt−1 )⟩ ≤
ft (w).
+ min
 2η w∈W
T

t=1

Combining the above result with Lemma 5.15, we have

T
∑

ft (wt ) − min

w∈W

t=1

By choosing η = min(1, L/

T
∑

ft (w) ≤

t=1

η
L
+
EGVT,2 .
2η 2L

√
EGVT,2 ), we have the regret bound claimed in Theorem 5.3.

The Lemma 5.15 is proved by induction. The key to the proof is that zt is the optimal
solution to the strongly convex minimization problem in Lemma 5.15, i.e.,
[

∑
L
zt = arg min
∥w∥22 +
fτ (zτ −1 ) + ⟨w − zτ −1 , ∇fτ (zτ −1 )⟩
w∈W 2η
t

]

τ =1

Proof of Lemma 5.15. We prove the inequality by induction. When T = 1, we have w1 =
z0 = 0 and
[

]
L
η
2
min
∥w∥2 + f1 (z0 ) + ⟨w − z0 , ∇f1 (z0 )⟩ +
∥∇f1 (z0 )∥22
2L
w∈W 2η
{
}
L
η
2
2
≥ f1 (z0 ) +
∥∇f1 (z0 )∥2 + min
∥w∥2 + ⟨w − z0 , ∇f1 (z0 )⟩
w
2L
2η
= f1 (z0 ) = f1 (w1 ).
where the inequality follows that by relaxing the minimization domain w ∈ W to the
whole space. We assume the inequality holds for t and aim to prove it for t + 1. To this end,
we define

159

[

]
t
∑
L
ψt (w) =
∥w∥22 +
fτ (zτ −1 ) + ⟨w − zτ −1 , ∇fτ (zτ −1 )⟩
2η
τ =1

+

η
2L

t−1
∑

∥∇fτ +1 (zτ ) − ∇fτ (zτ )∥22 .

τ =0

According to the updating procedure for zt in step 6, we have zt = arg minw∈W ψt (w).
Define ϕt = ψt (zt ) = minw∈W ψt (w). Since ψt (w) is a (L/η)-strongly convex function, we
have
L
∥w − zt ∥22 + ⟨w − zt , ∇ψt+1 (zt )⟩
2η
L
= ∥w − zt ∥22 + ⟨w − zt , ∇ψt (zt ) + ∇ft+1 (zt )⟩.
2η

ψt+1 (w) − ψt+1 (zt ) ≥

Setting w = zt+1 = arg minw∈W ψt+1 (w) in the above inequality results in

ψt+1 (zt+1 ) − ψt+1 (zt ) = ϕt+1 − (ϕt + ft+1 (zt ) +

η
∥∇ft+1 (zt ) − ∇ft (zt )∥22 )
2L

L
∥z
− zt ∥22 + ⟨zt+1 − zt , ∇ψt (zt ) + ∇ft+1 (zt )⟩
2η t+1
L
≥ ∥zt+1 − zt ∥22 + ⟨zt+1 − zt , ∇ft+1 (zt )⟩,
2η
≥

(5.26)

where the second inequality follows from the fact zt = arg minw∈W ψt (w), and therefore
(w − zt )⊤ ∇ψt (zt ) ≥ 0, ∀w ∈ W. Moving ft+1 (zt ) in the above inequality to the right hand
side, we have

160

ϕt+1 − ϕt −

η
∥∇ft+1 (zt ) − ∇ft (zt )∥22
2L

L
∥z
− zt ∥22 + ⟨zt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt )
2η t+1
{
}
L
2
≥ min
∥w − zt ∥2 + ⟨w − zt , ∇ft+1 (zt )⟩ + ft+1 (zt )
w∈W 2η








L

2
= min
∥w − zt ∥2 + ⟨w − zt , ∇ft (zt )⟩ +ft+1 (zt ) + ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩ .

2η
w∈W 






r(w)

≥

ρ(w)

(5.27)
To bound the right hand side, we note that wt+1 is the minimizer of ρ(w) by step 4 in
Algorithm 3, and ρ(w) is a L/η-strongly convex function, so we have

ρ(w) ≥ ρ(wt+1 ) + ⟨w − wt+1 , ∇ρ(wt+1 )⟩ +
≥0

L
L
∥w − wt+1 ∥22 ≥ ρ(wt+1 ) + ∥w − wt+1 ∥22 .
2η
2η

Then we have
L
∥w − wt+1 ∥22 + r(w)
2η
L
L
= ∥wt+1 − zt ∥22 + ⟨wt+1 − zt , ∇ft (zt )⟩ +ft+1 (zt ) + ∥w − wt+1 ∥22 + r(w)
2η
2η

ρ(w) + ft+1 (zt ) + r(w) ≥ ρ(wt+1 ) + ft+1 (zt ) +

ρ(wt+1 )

Plugging above inequality into the inequality in (5.27), we have

161

ϕt+1 − ϕt −
≥

η
∥∇ft+1 (zt ) − ∇ft (zt )∥22
2L

L
∥w
− zt ∥22 + ⟨wt+1 − zt , ∇ft (zt ) + ft+1 (zt )⟩
2η t+1
}
{
L
2
∥w − wt+1 ∥2 + ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩
+ min
w∈W 2η

To continue the bounding, we proceed as follows

ϕt+1 − ϕt −
≥

=

≥

η
∥∇ft+1 (zt ) − ∇ft (zt )∥22
2L

L
∥w
− zt ∥22 + ⟨wt+1 − zt , ∇ft (zt )⟩ + ft+1 (zt )
2η t+1
{
}
L
2
+ min
∥w − wt+1 ∥2 + ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩
w∈W 2η
L
∥w
− zt ∥22 + ⟨wt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt )
2η t+1
}
{
L
2
∥w − wt+1 ∥2 + ⟨w − wt+1 , ∇ft+1 (zt ) − ∇ft (zt )⟩
+ min
w∈W 2η
L
∥w
− zt ∥22 + ⟨wt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt )
2η t+1
}
{
L
2
∥w − wt+1 ∥2 + ⟨w − wt+1 , ∇ft+1 (zt ) − ∇ft (zt )⟩
+ min
w
2η

L
η
∥wt+1 − zt ∥22 + ⟨wt+1 − zt , ∇ft+1 (zt )⟩ + ft+1 (zt ) −
∥∇ft+1 (zt ) − ∇ft (zt )∥22
2η
2L
η
≥ft+1 (wt+1 ) −
∥∇ft+1 (zt ) − ∇ft (zt )∥22 ,
2L
=

where the first equality follows by writing ⟨wt+1 − zt , ∇ft (zt )⟩ = ⟨wt+1 − zt , ∇ft+1 (zt )⟩−
⟨wt+1 − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩ and combining with ⟨w − zt , ∇ft+1 (zt ) − ∇ft (zt )⟩, and
the last inequality follows from the smoothness condition of ft+1 (w). Since by induction
ϕt ≥

∑t

τ =1 fτ (wτ ), we have ϕt+1 ≥

∑t+1

τ =1 fτ (wτ ).

162

5.6.2

Proof of Theorem 5.5

To prove Theorem 5.5, we need the following lemma, which is the Lemma 3.1 in [117] stated
in our notations.
Lemma 5.16 (Lemma 3.1 [117]). Let Φ(z) be a α-strongly convex function with respect to
the norm ∥ · ∥, whose dual norm is denoted by ∥ · ∥∗ , and B(w, z) = Φ(w) − (Φ(z) + (w −
z)⊤ Φ′ (z)) be the Bregman distance induced by function Φ(w). Let Z be a convex compact
set, and U ⊆ Z be convex and closed. Let z ∈ Z, γ > 0, Consider the points,

w = arg min γu⊤ ξ + B(u, z),

(5.28)

z+ = arg min γu⊤ ζ + B(u, z),

(5.29)

u∈U

u∈U

then for any u ∈ U , we have

γζ ⊤ (w − u) ≤ B(u, z) − B(u, z+ ) +

γ2
α
∥ξ − ζ∥2∗ − [∥w − z∥2 + ∥w − z+ ∥2 ].
α
2

In order not to have readers struggle with complex notations in [117] for the proof of
Lemma 5.16, we present a detailed proof later in Appendix A.3 which is an adaption of the
original proof to our notations.
Theorem 5.5 can be proved by using the above lemma, because the updates of wt , zt
can be written equivalently as (5.28). The proof below starts from (5.16) and bounds the
summation of each term over t = 1, . . . , T , respectively.
Proof of Theorem 5.5. First, we note that the two updates in step 4 and step 6 of Algorithm 4
fit in the Lemma 5.16 if we let U = Z = W, z = zt−1 , w = wt , z+ = zt , and Φ(w) = 21 ∥w∥22 ,

163

which is 1-strongly convex function with respect to ∥ · ∥2 . Then B(u, z) = 12 ∥u − z∥22 . As
a result, the two updates for wt , zt in Algorithm 4 are exactly the updates in (5.28) with
z = zt−1 , γ = η/L, ξ = ∇ft−1 (zt−1 ), and ζ = ∇ft (wt ). Replacing these into (5.16), we
have the following inequality for any u ∈ W,

)
η
1(
⊤
2
2
(wt − u) ∇ft (wt ) ≤
∥u − zt−1 ∥2 − ∥u − zt ∥2
L
2
)
1(
η2
∥wt − zt−1 ∥22 + ∥wt − zt ∥22
+ 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥22 −
2
L
(5.30)
Then we have
)
η
η
1(
2
2
⊤
∥u
−
z
∥
−
∥u
−
z
∥
(ft (wt ) − ft (u)) ≤ (wt − u) ∇ft (wt ) ≤
t 2
t−1 2
L
L
2
2η 2
2η 2
+ 2 ∥∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22 + 2 ∥∇ft (wt ) − ∇ft (wt−1 )∥22
L
L
)
(
1
−
∥wt − zt−1 ∥22 + ∥wt − zt ∥22
2
) 2η 2
1(
2
2
≤
∥u − zt−1 ∥2 − ∥u − zt ∥2 + 2 ∥∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22
2
L
)
(
1
2
2
2
2
+ 2η ∥wt − wt−1 ∥2 −
∥wt − zt−1 ∥2 + ∥wt − zt ∥2 ,
2

(5.31)

where the first inequality follows the convexity of ft (w), and the third inequality follows the
smoothness of ft (w).
By taking the summation over t = 1, · · · , T with z∗ = arg min

u∈W

164

∑T

t=1 ft (u), and dividing

both sides by η/L, we have
T
∑

ft (wt ) − min

w∈W

t=1

T −1
2η ∑
L
ft (w) ≤
+
∥∇ft+1 (wt ) − ∇ft (wt )∥22
2η
L

∑

t=1
T
∑

+

t=0

2η 2 ∥wt − wt−1 ∥22 −

t=1

T
∑
1(
t=1

2

∥wt − zt−1 ∥22 + ∥wt − zt ∥22

)

≜BT

We can bound BT as follows:

BT =
≥

T
T +1
1∑
1∑
∥wt − zt−1 ∥22 +
∥wt−1 − zt−1 ∥22
2
2

1
2

1
≥
4

t=1
T (
∑
t=2
T
∑
t=2

t=2

∥wt − zt−1 ∥22 + ∥wt−1 − zt−1 ∥22

1
∥wt − wt−1 ∥22 =
4

T
∑

)

∥wt − wt−1 ∥22

t=1

where the last equality follows that w1 = w0 . Plugging the above bound into (5.32), we
have

T
∑
t=1

ft (wt ) − min

w∈W

∑

ft (w) ≤

t=1
T (
∑

2η 2 −

+

t=1

T −1
L
2η ∑
+
∥∇ft+1 (wt ) − ∇ft (wt )∥22
2η
L

1
4

t=0

)

∥wt − wt−1 ∥22

We complete the proof by plugging the value of η.

5.6.3

Proof of Corollary 5.12

We first have the key inequality in (5.19): for any z ∈ W

165

ft (wt ) − ft (z) ≤ Bt (z, zt−1 ) − Bt (z, zt ) − β∥wt − z∥2M
t
]
[
1
+ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2 −1 −
∥wt − zt−1 ∥2H + ∥wt − zt ∥2H .
t
t
Ht
2
Taking summation over t = 1, . . . , T , we have

T
∑

ft (wt ) −

t=1

T
∑

ft (z) ≤

t=1

+

T
∑
t=1

T
∑

(Bt (z, zt−1 ) − Bt (z, zt )) −

t=1

T
∑
t=1

≜At

∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2 −1 −
H

T
∑
1[

t

t=1

2

β∥wt − z∥2M

t

≜Ct

∥wt − zt−1 ∥2H + ∥wt − zt ∥2H
t

]
t

≜Bt

≜St

Next we bound each term individually. First,
T
∑

At = B1 (z, z0 ) − BT (z, zT ) +

T∑
−1

(Bt+1 (z, zt ) − Bt (z, zt ))

(5.32)

t=1

t=1

Note that B1 (z, z0 ) = 21 (1 + βG2 )∥z∥22 ≤ 12 (1 + βG2 ) for any z ∈ W, and Bt+1 (z, zt ) −
Bt (w, zt ) = β2 ∥z − zt ∥2H , therefore
t

T
∑
t=1

∑β
1
At ≤ (1 + βG2 ) +
∥z − zt ∥2H
t
2
2
T

t=1

166

(5.33)

Then
T
∑
t=1

]
T [
∑
1
β
2
2
2
(At − Ct ) ≤ (1 + βG ) +
∥z − zt ∥M − β∥wt − z∥M
t
t
2
2
1
≤ (1 + βG2 ) +
2
1
≤ (1 + βG2 ) +
2

t=1
T [
∑
t=1
T
∑
t=1

]

β∥z − wt ∥2M + β∥wt − zt ∥2M − β∥wt − z∥2M
t

t

1
β∥wt − zt ∥2M ≤ (1 + βG2 ) +
t
2

t

T
∑

∥wt − zt ∥2H

t=1

t

(5.34)
Noting the updates in (5.18) and from inequality in (A.9) in the proof of Lemma 5.5 in
Appendix A.3, we can get

∥wt − zt ∥Ht ≤ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1
H
t

Next, we bound

T
∑
t=1

∑T

t=1 Bt .

T −1
T
1∑
1∑
2
∥wt+1 − zt ∥H
+
∥wt − zt ∥2H
Bt =
t
t+1
2
2

≥

1
2

1
≥
4
≥

1
4

t=0
T∑
−1
t=1
T∑
−1
t=1
T∑
−1

∥wt+1 − zt ∥2H +
t

1
2

t=1
T∑
−1

∥wt − zt ∥2H

t=1
T∑
−1

1
∥wt+1 − wt ∥2H ≥
t
4

t=0

t

(5.35)
1
∥wt+1 − wt ∥22 − ∥w1 − w0 ∥22
4

∥wt+1 − wt ∥22 ,

t=0

where the last inequality follows that w0 = w1 = 0. Therefore,

167

T
∑

ft (wt ) −

t=1

T
∑
t=1

∑
1
ft (z) ≤ (1 + βG2 ) + 2
∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1
Ht
2
T

t=1

−

1
4

T∑
−1

(5.36)

∥wt+1 − wt ∥22

t=0

To proceed, we need the following lemma.
Lemma 5.17. We have
T
∑
t=1


T
∑
β
4d 
∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 ≤
ln 1 +
∥∇ft (wt ) − ∇ft−1 (wt−1 )∥22 
Ht
β
4


t=1

(5.37)
Thus,

T
∑
4d 
β
∥∇ft (wt ) − ∇ft−1 (wt−1 )∥ −1 ≤
∥∇ft (wt ) − ∇ft−1 (wt−1 )∥22 
ln 1 +
Ht
β
4
t=1
t=1


T
∑
4d 
β
≤
ln 1 +
∥∇ft (wt ) − ∇ft (wt−1 ) + ∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22 
β
4
t=1


T
T
β∑
4d 
β∑ 2
≤
ln 1 +
L ∥wt − wt−1 ∥22 +
∥∇ft (wt−1 ) − ∇ft−1 (wt−1 )∥22 
β
2
2
t=1
t=1


T∑
−1
T∑
−1
4d 
β
β
≤
ln 1 +
L2 ∥wt+1 − wt ∥22 +
∥∇ft+1 (wt ) − ∇ft (wt )∥22 
β
2
2
t=0
t=0


T∑
−1
4d 
β
β
≤
ln 1 +
L2 ∥wt+1 − wt ∥22 + EGVT,2 
β
2
2


T
∑

t=0

(5.38)

168

Then,
T
∑

ft (wt ) −

t=1

T
∑
t=1


8d 
β
1
ft (z) ≤ (1 + βG2 ) +
ln 1 +
2
β
2
−

T∑
−1


L2 ∥wt+1 − wt ∥22 +

t=0

β
EGVT,2 
2

T −1
1∑
∥wt+1 − wt ∥22
4
t=0

(5.39)
Without loss of generality we assume 8dL2 ≥ 1. Next, let us consider two cases. In the
first case, we assume βEGVT,2 ≤ 16dL2 . Then





T∑
−1

8d 
β
β
ln 1 +
L2 ∥wt+1 − wt ∥22 + EGVT,2 
β
2
2
t=0


T∑
−1
β
8d 
L2 ∥wt+1 − wt ∥22 + 8dL2 
ln 1 +
≤
β
2
t=0


T∑
−1
8d  β
≤
ln
L2 ∥wt+1 − wt ∥22 + 16dL2 
β
2
t=0
[
( ∑
)]
β
T −1 2
2
L
∥w
−
w
∥
8d
t
t+1
2 +1
=
ln 16dL2 + ln 2 t=0
β
16dL2
∑T −1
β ∑ −1 2
2
L ∥wt+1 − wt ∥22
8d
8d
8d 2 Tt=0
2
2
t=0 ∥wt+1 − wt ∥2
=
≤
ln 16dL +
ln
16dL
+
β
β
β
4
16dL2
(5.40)
where the last inequality follows ln(1 + x) ≤ x for x ≥ 0. Then we get
T
∑
t=1

ft (wt ) −

T
∑
t=1

8d
1
ln 16dL2
ft (z) ≤ (1 + βG2 ) +
2
β

In the second case, we assume βEGVT,2 ≥ 16dL2 ≥ 2, then we have

169

(5.41)





T∑
−1

8d 
β
β
ln 1 +
L2 ∥wt+1 − wt ∥22 + EGVT,2 
β
2
2
t=0


T∑
−1
8d  β
ln
≤
L2 ∥wt+1 − wt ∥22 + βEGVT,2 
β
2
t=0
[
( ∑
)]
β
T −1 2
2
L
∥w
−
w
∥
8d
t 2
t+1
≤
ln(βEGVT,2 ) + ln 2 t=0
+1
β
βEGVT,2
β ∑ −1 2
L ∥wt+1 − wt ∥22
8d
8d 2 Tt=0
=
ln(βEGVT,2 ) +
β
β
βEGVT,2
∑
−1
4dL2 Tt=0
∥wt+1 − wt ∥22
8d
=
ln(βEGVT,2 ) +
β
βEGVT,2
∑T −1
∥wt+1 − wt ∥22
8d
ln(βEGVT,2 ) + t=0
≤
β
4

(5.42)

where the last inequality follows βEGVT,2 ≥ 16dL2 . Then we get
T
∑

ft (wt ) −

T
∑
t=1

t=1

1
8d
ft (z) ≤ (1 + βG2 ) +
ln(βEGVT,2 )
2
β

(5.43)

Thus, we complete the proof by combining the two cases.
Next, we prove Lemma 5.17. We need the following lemma, which can be proved by using
Lemma 6 [68] and noting that |I +

∑t

∑T
⊤
2 d
τ =1 uτ uτ | ≤ (1 + t=1 ∥ut ∥2 ) , where | · | denotes the

determinant of a matrix.
Lemma 5.18. Let u1 , u2 , · · · , uT ∈ Rd be a sequence of vectors. Let Vt = I +

∑t

⊤
τ =1 uτ uτ .

Then,
T
∑


−1

u⊤
t Vt ut ≤ d ln 1 +

T
∑


∥ut ∥22 

(5.44)

t=1

t=1

To prove Lemma 5.17, we let vt = ∇ft (wt ), t = 1, . . . , T and v0 = 0. Then Ht =

170

I + βG2 I + β

∑t−1

⊤
τ =0 vτ vτ . Note that we assume ∥∇ft (w)∥2 ≤ G, therefore
t
β∑
⊤
vτ vτ ≥ I +
(vτ vτ⊤ + vτ −1 vτ⊤−1 )
Ht ≥ I + β
2
τ =1
τ =1
t
β∑
≥I+
(vτ − vτ −1 )(vτ − vτ −1 )⊤ = Vt
4
τ =1
t
∑

(5.45)

√
∑
Let ut = ( β/2)(vt − vt−1 ), then Vt = I + tτ =1 uτ u⊤
τ . By applying the above lemma,
we have


T
T
∑
∑
β
β
(vt − vt−1 )⊤ Vt−1 (vt − vt−1 ) ≤ d ln 1 +
∥vt − vt−1 ∥22 
4
4
t=1

(5.46)

t=1

Thus,
T
∑

T
∑
(vτ − vτ −1 )⊤ Vt−1 (vτ − vτ −1 )

(vτ − vτ −1 )⊤ H−1
t (vτ − vτ −1 ) ≤

t=1

t=1




T
∑
β
4d 
ln 1 +
∥vt − vt−1 ∥22 
≤
β
4

(5.47)

t=1

5.6.4

Proof of Theorem 5.13

Let ht (w) = ft (w) + ⟨gt − ∇ft (wt ), w⟩. It is easy seen that ∇ht (wt ) = gt . Followed by
Lemma 5.16, we have for any z ∈ (1 − α)W,

) η2
η
1(
⊤
2
2
∇ht (wt ) (wt − z) ≤
∥z − zt−1 ∥2 − ∥z − zt ∥2 + 2 ∥gt − gt−1 ∥22
G
2
G
(
)
1
∥wt − zt−1 ∥22 + ∥wt − zt ∥22
−
2
Taking summation over t = 1, . . . , T , we have,

171

(5.48)

T
T
∑
∥z − z0 ∥22 ∑ η 2
η
⊤
∥gt − gt−1 ∥22
∇ht (wt ) (wt − z) ≤
+
G
2
G2
t=1

−

t=1

T
∑
t=1

)
1(
∥wt − zt−1 ∥22 + ∥wt − zt ∥22
2

∑1
∥z − z0 ∥22 ∑ η 2
2−
≤
+
∥g
−
g
∥
∥wt − wt−1 ∥22
t
t−1
2
2
2
4
G
≤
≤

T

T

t=1

t=1

∑1
1 ∑ η2
2−
+
∥g
−
g
∥
∥wt − wt−1 ∥22
t
t−1
2
2
2
4
G
1
+
2
−

T

T

t=1
T
∑

t=1
T
∑

t=1
T
∑
t=1

2η 2
∥gt − gt−1 ∥22 +
G2

t=1

2η 2
∥gt − gt−1 ∥22
G2

1
∥wt − wt−1 ∥22
4

T
d
1
2η 2 ∑ ∑
≤ + 2 2
(ft (wt + δei ) − ft (wt−1 + δei ))ei − (ft (wt ) − ft (wt−1 ))ei
2 δ G
t=1 i=1

2

2

T
d
2η 2 ∑ ∑
+ 2 2
(ft (wt−1 + δei ) − ft−1 (wt−1 + δei ))ei − (ft (wt−1 ) − ft−1 (wt−1 ))ei
δ G
t=1 i=1

−

T
∑
t=1

2

2

1
∥wt − wt−1 ∥22 ,
4
(5.49)

where the second inequality follows (5.32). Next, we bound the middle two terms in right
hand side of the above inequality.

172

T
d
∑
∑

2

(ft (wt + δei ) − ft (wt−1 + δei ))ei − (ft (wt ) − ft (wt−1 ))ei

t=1 i=1

≤
≤

T
∑
t=1
T
∑

2d

2
d (
∑

|ft (wt + δei ) − ft (wt−1 + δei )|2 + |ft (wt ) − ft (wt−1 )|2

)

(5.50)

i=1

4d2 G2 ∥wt − wt−1 ∥22 ,

t=1

and
d
T
∑
∑

2

(ft (wt−1 + δei ) − ft−1 (wt−1 + δei ))ei − (ft (wt−1 ) − ft−1 (wt−1 ))ei

t=1 i=1

≤
≤

T
∑
t=1
T
∑
t=1

2d

2
d (
∑

|ft (wt−1 + δei ) − ft−1 (wt−1 + δei )|2 + |ft (wt−1 ) − ft−1 (wt−1 )|2

)

(5.51)

i=1

4d2 max |ft (w) − ft−1 (w)|2 .
w∈W

Then we have
T
T
∑
η
1 8d2 η 2 ∑
∇ht (wt )⊤ (wt − z) ≤ +
∥wt − wt−1 ∥22
G
2
δ2
t=1

t=1
T
T
∑
8d2 η 2 ∑
1
2
max |ft (w) − ft−1 (w)| −
+ 2 2
∥wt − wt−1 ∥22
4
δ G
w∈W
t=1
t=1
T
1 8d2 η 2 ∑
max |ft (w) − ft−1 (w)|2
≤ + 2 2
2
δ G
w∈W
t=1

(5.52)

√
where the last inequality follows that η ≤ δ/(4 2d). Then by using the convexity of
ht (w) and dividing both sides by η/G, we have

173

T
∑

ht (wt ) − min ht ((1 − α)w) ≤
w∈W

t=1

√
)
(√
8ηd2
G
c ≤ 4d max
c
+
EVAR
2G,
EVAR
T
T
2η
δ
Gδ 2
(5.53)

Following the the proof of Theorem 8 in [5], we have
T
∑

ft (wt ) −

t=1

≤
≤

T
∑

ft (w) ≤

t=1

T
∑
t=1
T
∑

ht (wt ) −
ht (wt ) −

t=1

T
∑
t=1
T
∑

T
∑

ht (wt ) −

t=1

ht (w) +

T
∑

ht (w) +

t=1

T
∑

T
∑

ft (wt ) − ht (wt ) − ft (w) + ht (w)

t=1

⟨gt − ∇ft (wt ), w − wt ⟩

t=1

ht (w) +

√
dLδT

t=1

(5.54)
where the last inequality follows from the following facts:
√
dLδ
∥gt − ∇ft (wt )]∥2 ≤
2

(5.55)

∥w − wt ∥ ≤ 2
Then we have
T
∑
t=1

ft (wt ) − min

w∈W

T
∑
t=1

√
(√
) √
4d
c
ft ((1 − α)w) ≤
max
2G, EVART + dLδT
δ

(5.56)

By the Lipschitz continuity of ft (w), we have
T
∑
t=1

ft ((1 − α)w) ≤

T
∑
t=1

174

ft (w) + GαT

(5.57)

The we get
T
∑

ft (wt ) − min

w∈W

t=1

T
∑
t=1

ft (w) ≤

√
)
(√
√
4d
max
2G, EVARcT + δ dLT + αGT
δ

(5.58)

Plugging the stated values of δ and α completes the proof.

5.7

Summary

In this chapter, we proposed two novel algorithms for online convex optimization that bound
the regret by the gradual variation of consecutive cost functions. The first algorithm is an
improvement of the FTRL algorithm, and the second algorithm is based on the mirror prox
method. Both algorithms maintain two sequence of solution points, a sequence of decision
points and a sequence of searching points, and share the same order of regret bound up to a
constant. The online mirror prox method only requires to keep tracking of a single gradient
of each cost function, while the improved FTRL algorithm needs to evaluate the gradient
of each cost function at two points and maintain a sum of up-to-date gradients of the cost
functions.
We note that a very recent work Chiang et al. [40] extends the prox method into a
two-point bandit setting and achieves a similar regret bound in expectation as that in the
( √
)
(
)
full setting, i.e., O d2 EGVT,2 ln T for smooth functions and O d2 ln(EGVT,2 + ln T) for
smooth and strongly convex cost functions, where EGVT,2 is the gradual variation defined on
the gradients of the cost functions. We would like to make a thought-provoking comparison
between our regret bound and their regret bound for online bandit convex optimization with
smooth cost functions. First, the gradual variation in our bandit setting is defined on the

175

values of the cost functions in contrast to that defined on the gradients of the cost functions.
Second, we query the cost function d times in contrast to 2 times in their algorithms, and
as a tradeoﬀ our regret bound has a better dependence on the number of dimensions (i.e.,
O(d)) than that (i.e., O(d2 )) of their regret bound. Third, our regret bound has an annoying
√
√
factor of T in comparison with ln T in theirs. Therefore, some open problems are how to
achieve a lower order of dependence on d than d2 in the two-point bandit setting, and how to
√
remove the factor of T while keeping a small order of dependence on d in our multi-point
bandit setting; and studying the two diﬀerent types of gradual variations for bandit settings
is a future work as well.

5.8

Bibliographic Notes

As is well known, a wide range of literature deals with the online decision making problem and there exist a number of regret-minimizing algorithms that have the optimal regret
bound. The first distribution-free framework for sequential decision making was proposed
by Hannan [65] which was rediscovered in [86]. Blackwell in his seminal paper [23] generalized the Hannan’s result and concerned the problem of playing a repeated game with a
vector-valued payoﬀ function and gave a precise necessary and suﬃcient condition for when
a set is approachable. The most well-known and successful work is probably the Hedge
algorithm [58], which was a direct generalization of Littlestone and Warmuth’s Weighted
Majority (WM) algorithm [99]. Another algorithm for online decision making problem is
the Vovk’s so-called aggregating strategies [147]. Other recent studies include the improved
theoretical bounds and the parameter-free hedging algorithm [39] and adaptive Hedge [53]
for decision-theoretic online learning. We refer readers to the [38] for an in-depth discussion

176

of this subject.
As we already discussed in Chapter 2, over the past decade many algorithms have been
proposed for online convex optimization, especially for online linear optimization. As the first
seminal paper in online convex optimization, Zinkevich [158] proposed a gradient descent al√
gorithm with a regret bound of O( T ). When cost functions are strongly convex, the regret
bound of the online gradient descent algorithm is reduced to O(log T ) with appropriately
chosen step size [68]. Another common methodology for online convex optimization, especially for online linear optimization, is based on the framework of Follow the Leader (FTL).
FTL chooses wt by minimizing the cost incurred by wt in all previous trials. Since the
naive FTL algorithm fails to achieve sublinear regret in the worst case, many variants have
been developed to fix the problem, including Follow The Perturbed Leader (FTPL) [86], Follow The Regularized Leader (FTRL) [3], and Follow The Approximate Leader (FTAL) [68].
Other methodologies for online convex optimization introduce a potential function (or link
function) to map solutions between the space of primal variables and the space of dual variables, and carry out primal-dual update based on the potential function. The well-known
Exponentiated Gradient (EG) algorithm [89] or Multiplicative Weights algorithm [99, 58]
belong to this category. We note that these diﬀerent algorithms are closely related. For example, in online linear optimization, the potential-based primal-dual algorithm is equivalent
to FTRL algorithm.

177

Chapter 6
Gradual Variation for Composite
Losses
This chapter continues our investigation and analysis of online learning methods which can
lead to better regret bounds in gradually evolving environments. The results we have obtained in Chapter 5 rely on the assumption that the cost functions are smooth. Additionally,
we showed that for general non-smooth functions when the only information presented to
the learner is the first order information about the cost functions, it is impossible to obtain
a regret bounded by gradual variation. However, in this chapter, we show that a gradual
variation bound is achievable for a special class of non-smooth functions that is composed
of a smooth component and a non-smooth component.
We consider two categories for the non-smooth component. In the first category, we
assume that the non-smooth component is a fixed function and is relatively easy such that
the composite gradient mapping can be solved without too much computational overhead
compared to gradient mapping. A common example that falls into this category is to consider
a non-smooth regularizer. For example, in addition to the basic domain W, one would enforce
the sparsity constraint on the decision vector w, i.e., ∥w∥0 ≤ k < d, which is important in
feature selection. However, the sparsity constraint ∥w∥0 ≤ k is a non-convex function, and
is usually implemented by adding a ℓ1 regularizer λ∥w∥1 to the objective function, where

178

λ > 0 is a regularization parameter. Therefore, at each iteration the cost function is given
by ft (w) + λ∥w∥1 . To prove a regret bound by gradual variation for this type of non-smooth
optimization, we first present a simplified version of the general online mirror prox method
from Chapter 5 and show that it has the exactly same regret bound as stated in Chapter 5,
and then extend the algorithm to the non-smooth optimization with a fixed non-smooth
component.
In the second category, we assume that the non-smooth component can be written as
an explicit maximization structure. In general, we consider a time-varying non-smooth
component, present a primal-dual prox method, and prove a min-max regret bound by
gradual variation. When the non-smooth components are equal across all trials, the usual
regret is bounded by the min-max bound plus a variation in the non-smooth component. To
see an application of min-max regret bound, we consider the problem of online classification
with hinge loss and show that the number of mistakes can be bounded by a variation in
sequential examples.
Before moving to the detailed analysis, it is worth mentioning that several pieces of works
have proposed algorithms for optimizing the two types of non-smooth functions as described
above to obtain an optimal convergence rate of O(1/T ) [122, 123]. Therefore, the existence
of a regret bound by gradual variation for these two types of non-smooth optimization does
not violate the contradictory argument in Section 5.3.

179

6.1

Composite Losses with a Fixed Non-smooth Component

6.1.1

A Simplified Online Mirror Prox Algorithm

In this subsection, we present a simplified version of online mirror prox (OMP) method
algorithm proposed in Chapter 5, which is the foundation for us to develop the algorithm
for non-smooth optimization.
The key trick is to replace the domain constraint w ∈ W with a non-smooth function in
the objective. Let δW (w) denote the indicator function of the domain W, i.e.,



0,




δW (w) =

w∈W






 ∞, otherwise

Then the proximal gradient mapping for updating w (step 4) in Algorihtm 5 is equivalent
to
wt = arg min⟨w, ∇ft−1 (wt−1 )⟩ +
w

L
B(w, zt−1 ) + δW (w).
η

By the first order optimality condition, there exists a sub-gradient vt ∈ ∂δW (wt ) such
that
∇ft−1 (wt−1 ) +

L
(∇Φ(wt ) − ∇Φ(zt−1 )) + vt = 0.
η

(6.1)

Thus, wt is equal to

wt = arg min⟨w, ∇ft−1 (wt−1 ) + vt ⟩ +
w

180

L
B(w, zt−1 ).
η

(6.2)

Then we can change the update for zt to

zt = arg min⟨w, ∇ft (wt ) + vt ⟩ +
w

L
B(w, zt−1 ).
η

(6.3)

The key ingredient of above update compared to step 6 in Algorithm 5 is that we explicitly
use the sub-gradient vt that satisfies the optimality condition for wt instead of solving a
domain constrained optimization problem. The advantage of updating zt by (6.3) is that we
can easily compute zt by the first order optimality condition, i.e.,

∇ft (wt ) + vt +

L
(∇Φ(zt ) − ∇Φ(zt−1 )) = 0.
η

(6.4)

Note that Eq. (6.1) indicates vt = −∇ft−1 (wt−1 ) − ∇Φ(wt ) + ∇Φ(zt−1 ). By plugging
this into (6.4), we reach to the following simplified update for zt ,

∇Φ(zt ) = ∇Φ(wt ) +

η
(∇ft−1 (wt−1 ) − ∇ft (wt )).
L

The simplified version of Algorithm 5 is presented in Algorithm 7.
Remark 6.1. We make three remarks for Algorithm 7. First, the searching point zt does
not necessarily belong to the domain W, which is usually not a problem given that the decision point wt is always in W. Nevertheless, the update can be followed by a projection step
zt = minw∈W B(w, z′t ) to ensure the searching point also stay in the domain W, where we
η
slightly abuse the notation z′t in ∇Φ(z′t ) = ∇Φ(wt ) + L
(∇ft−1 (wt−1 ) − ∇ft (wt )).

181

Algorithm 7 A Simplified General Online Mirror Prox Method
1: Input: η > 0, Φ(z)
2: Initialize:: z0 = w0 = minz∈W Φ(z) and f0 (w) = 0
3: for t = 1, . . . , T do
4:
Predict wt by
{
}
L
wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 )
η
w∈W
5:
6:

Receive cost function ft (·) and incur loss ft (wt )
Update zt by solving
∇Φ(zt ) = ∇Φ(wt ) +

η
(∇ft−1 (wt−1 ) − ∇ft (wt )).
L

7: end for

Second, the update in step 6 can be implemented by [38, Chapter 11]:

zt = ∇Φ∗ (∇Φ(wt ) +

η
(∇ft−1 (wt−1 ) − ∇ft (wt ))),
L

where Φ∗ (·) is the Legendre-Fenchel conjugate of Φ(·) (see Appendix ?? for definition). For
example, when Φ(w) = 1/2∥w∥22 , Φ∗ (w) = 1/2∥w∥22 and the update for the searching point
is given by
zt = wt + ( η/L)(∇ft−1 (wt−1 ) − ∇ft (wt ));
when Φ(w) =

∑

∗
i wi ln wi , Φ (w) = log [

∑

i exp(wi )] and the update for the searching point

can be computed by

[zt ]i ∝ [wt ]i exp (η/L[∇ft−1 (wt−1 ) − ∇ft (wt )]) ,

s.t.

∑

[zt ]i = 1.

i

Third, the key inequality in (5.16) for proving the regret bound still hold for ζ = ∇ft (wt )+
vt , ξ = ∇ft−1 (wt−1 ) + vt by noting the equivalence between the pairs (6.2, 6.3) and (5.28,
5.29), which is given below:
182

η
⟨∇ft (wt ) + vt , wt − w⟩ ≤ B(w, zt−1 ) − B(w, zt )
L
γ2
α
+ ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2∗ − [∥wt − zt−1 ∥2 + ∥w − zt ∥2 ], ∀w ∈ W,
α
2
where vt ∈ ∂δW (wt ). As a result, we can apply the same analysis as in the proof of
Theorem 5.5 to obtain the same regret bound in Theorem 5.6 for Algorithm 7. Note that the
above inequality remains valid even if we take a projection step after the update for z′t due
to the generalized pythagorean inequality B(w, zt ) ≤ B(w, z′t ), ∀w ∈ W [38].

6.1.2

A Gradual Variation Bound for Online Non-Smooth Optimization

In spirit of Algorithm 7, we present an algorithm for online non-smooth optimization of
functions ft (w) = ft (w) + g(w) with a regret bound by gradual variation as:

EGVT =

T∑
−1

∥∇ft+1 (wt ) − ∇ft (wt )∥2∗ .

t=0

The trick is to solve the composite gradient mapping:

wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ +
w∈W

L
B(w, zt−1 ) + g(w)
η

and update zt by

∇Φ(zt ) = ∇Φ(wt ) +

η
(∇ft−1 (wt−1 ) − ∇ft (wt )).
L

183

Algorithm 8 Online Mirror Prox Method with a Fixed Non-Smooth Component
1: Input: η > 0, Φ(z)
2: Initialization: z0 = w0 = minz∈W Φ(z) and f0 (w) = 0
3: for t = 1, . . . , T do
4:
Predict wt by
{
}
L
wt = arg min ⟨w, ∇ft−1 (wt−1 )⟩ + B(w, zt−1 ) + g(w)
η
w∈W
Receive cost function ft (·) and incur loss ft (wt )
Update zt by
(
)
η
∗
′
zt = ∇Φ ∇Φ(wt ) + (∇ft−1 (wt−1 ) − ∇ft (wt ))
L

5:
6:

and zt = minw∈W B(w, z′t )
7: end for
Algorithm 8 shows the detailed steps and Corollay 6.2 states the regret bound, which
can be proved similarly.
Corollary 6.2. Let ft (w) = ft (w) + g(w), t = 1, . . . , T be a sequence of convex functions
where ft (w) are L-smooth continuous w.r.t ∥ · ∥ and g(w) is a non-smooth function, Φ(z)
be a α-strongly convex function w.r.t ∥ · ∥, and EGVT be defined in (5.13). By setting
{√ √
}
√
η = (1/2) min
α/ 2, LR/ EGVT , we have the following regret bound
T
∑
t=1

6.2

ft (wt ) − min

w∈W

T
∑

ft (w) ≤ 2R max

(√

)
√ √
2LR/ α, EGVT .

t=1

Composite Losses with an Explicit Max Structure

In previous subsection, we assume the composite gradient mapping with the non-smooth
component can be eﬃciently solved. Here, we replace this assumption with an explicit max
structure of the non-smooth component.
In what follows, we present a primal-dual prox method for such non-smooth cost functions
184

and prove its regret bound. We consider a general setting, where the non-smooth functions
ft (w) has the following structure:

ft (w) = ft (w) + max⟨At w, u⟩ − ϕt (u),

(6.5)

u∈Q

where ft (w) and ϕt (u) are L1 -smooth and L2 -smooth functions, respectively, and At ∈
Rm×d is a matrix used to characterize the non-smooth component of ft (w) with −ϕt (u) by
maximization. Similarly, we define a dual cost function ϕt (u) as

ϕt (u) = −ϕt (u) + min ⟨At w, u⟩ + ft (w).

(6.6)

x∈W

We refer to w as the primal variable and to u as the dual variable. To motivate the setup,
let us consider online classification with hinge loss ℓt (w) = max(0, 1 − yt ⟨w, xt ⟩), where we
slightly abuse the notation (xt , yt ) to denote the attribute and label pair received at trial
t. It is straightforward to see that ℓt (w) is a non-smooth function and can be cast into the
form in (6.5) by

⊤
ℓt (w) = max α(1 − yt x⊤
t w) = max −αyt xt w + α.
α∈[0,1]

α∈[0,1]

To present the algorithm and analyze its regret bound, we introduce some notations. Let
Ft (w, u) = ft (w) + ⟨At w, u⟩ − ϕt (u), Φ1 (w) be a α1 -strongly convex function defined on
the primal variable w w.r.t a norm ∥ · ∥p and Φ2 (u) be a α2 -strongly convex function defined
on the dual variable u w.r.t a norm ∥ · ∥q . Correspondingly, let B1 (w, z) and B2 (u, v) denote
the induced Bregman distance, respectively. We assume the domains W, Q are bounded

185

and matrices At have a bounded norm, i.e.,

max ∥w∥p ≤ R1 ,

w∈W

max ∥u∥q ≤ R2
u∈Q

max Φ1 (w) − min Φ1 (w) ≤ M1

w∈W

w∈W

(6.7)

max Φ2 (u) − min Φ2 (u) ≤ M2
u∈Q

∥At ∥p,q =

u∈Q

max

∥w∥p ≤1,∥u∥q ≤1

u⊤ At w ≤ σ.

Let ∥ · ∥p,∗ and ∥ · ∥q,∗ denote the dual norms to ∥ · ∥p and ∥ · ∥q , respectively. To prove a
variational regret bound, we define a gradual variation as follows:

EGVT,p,q =

+

T∑
−1
t=0
T∑
−1

∥∇ft+1 (wt ) − ∇ft (wt )∥2p,∗ + (R12 + R22 )

T∑
−1

∥At − At−1 ∥2p,q

t=0

∥∇ϕt+1 (ut ) − ∇ϕt (ut )∥2q,∗ .

t=0

Given above notations, Algorithm 9 shows the detailed steps and Theorem 6.3 states a
min-max bound.
Theorem 6.3. Let ft (w) = ft (w) + maxu∈Q ⟨At w, u⟩ − ϕt (u), t = 1, . . . , T be a sequence of
non-smooth functions. Assume ft (w), ϕ(u) are L = max(L1 , L2 )-smooth functions and the
domain W, Q and At satisfy the boundness condition as in (6.7). Let Φ(w) be a α1 -strongly
convex function w.r.t the norm ∥·∥p , Φ(u) be a α2 -strongly convex function w.r.t. the norm ∥·
)
(√
√
√
√
M1 + M2 /(2 EGVT,p,q ), α/(4 σ 2 + L2 )
∥q , and α = min(α1 , α2 ). By setting η = min
in Algorithm 9, then we have

186

Algorithm 9 Online Mirror Prox Method with an Explicit Max Structure
1: Input: η > 0, Φ1 (z), Φ2 (v)
2: Initialization: z0 = w0 = minz∈W Φ1 (z), v0 = u0 = minv∈Q Φ2 (v) and f0 (w) =
ϕ0 (u) = 0
3: for t = 1, . . . , T do
4:
Update ut by
{
}
L2
ut = arg max ⟨u, At−1 wt−1 − ∇ϕt−1 (ut−1 )⟩ −
B (u, vt−1 )
η 2
u∈Q
5:

Predict wt by
{
wt = arg min
w∈W

}

L1
⟨w, ∇ft−1 (wt−1 ) + At−1 A⊤
t−1 ut−1 ⟩ + η B1 (w, zt−1 )

6:
7:

Receive cost function ft (·) and incur loss ft (wt )
Update vt by
}
{
L2
vt = arg max ⟨u, At wt − ∇ϕt (ut )⟩ −
B (u, vt−1 )
η 2
u∈Q

8:

Update zt by
{
zt = arg min
w∈W

L1
⟨w, ∇ft (wt ) + A⊤
B1 (w, zt−1 )
t ut ⟩ +

}

η

9: end for

max
u∈Q

T
∑

Ft (wt , u) − min

t=1

w∈W

T
∑

Ft (w, ut )

t=1

( √
)
√
(M1 + M2 )(σ 2 + L2 ) √
≤ 4 M1 + M2 max 2
, EGVT,p,q .
α

To facilitate understanding, we break the proof into several lemmas. The following lemma
is by analogy with Lemma 5.16.
√
(w)
denote a single vector with a norm ∥θ∥ = ∥w∥2p + ∥u∥2q and
Lemma 6.4. Let θ =
√ u
a dual norm ∥θ∥∗ = ∥w∥2p∗ + ∥u∥2q∗ . Let Φ(θ) = Φ1 (w) + Φ2 (u), B(θ, ζ) = B1 (w, u) +
B2 (z, v). Then

187

(

) (
)
∇w Ft (wt , ut ) ⊤ wt − w
η
≤ B(θ, ζt−1 ) − B(θ, ζt )
−∇u Ft (wt , ut )
ut − u
)
(
+ η 2 ∥∇w Ft (wt , ut ) − ∇w Ft−1 (wt−1 , ut−1 )∥2p,∗
(
)
+ η 2 ∥∇u Ft (wt , ut ) − ∇u Ft−1 (wt−1 , ut−1 )∥2q,∗
)
α(
2
2
2
2
−
∥wt − zt ∥p + ∥ut − vt ∥q + ∥wt − zt−1 ∥p + ∥ut − vt−1 ∥q .
2
Proof. The updates of (wt , ut ) in Algorithm 9 can be seen as applying the updates in
( )
( )
(
)
wt
zt
zt−1
Lemma 5.16 with θt =
in place of w, ζt =
in place of z+ , ζt−1 =
in
ut
vt
vt−1
place of z. Note that Φ(θ) is a α = min(α1 , α2 )-strongly convex function with respect to the
norm ∥θ∥. Then applying the results in Lemma 5.16 we can complete the proof.
Applying the convexity of Ft (w, u) in terms of w and the concavity of Ft (w, u) in terms
of u to the result in Lemma 6.4, we have

η (Ft (wt , u) − Ft (w, ut ))
≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt )
⊤
2
+ η 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 ) + A⊤
t ut − At−1 ut−1 ∥p,∗

+ η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 ) + At wt − At−1 wt−1 ∥2q,∗
) α(
)
α(
−
∥wt − zt−1 ∥2p + ∥wt − zt ∥2p −
∥ut − vt−1 ∥2q + ∥ut − vt ∥2q
2
2
≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt )
⊤
2
+ 2η 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ + 2η 2 ∥A⊤
t ut − At−1 ut−1 ∥p,∗

+ 2η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 )∥2q,∗ + 2η 2 ∥At wt − At−1 wt−1 ∥2q,∗
) α(
)
α(
−
∥wt − zt−1 ∥2p + ∥wt − zt ∥2p −
∥ut − vt−1 ∥2q + ∥ut − vt ∥2q
2
2

188

(6.8)

The following lemma provides tools for proceeding the bound.
Lemma 6.5.

∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ ≤ 2∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 2L2 ∥wt − wt−1 ∥2p

∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 )∥2q,∗ ≤ 2∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗ + 2L2 ∥ut − ut−1 ∥2q

∥At wt − At−1 wt−1 ∥2q,∗ ≤ 2R12 ∥At − At−1 ∥2p,q + 2σ 2 ∥wt − wt−1 ∥2p

⊤
2
2
2
2
2
∥A⊤
t ut − At−1 ut−1 ∥p,∗ ≤ 2R2 ∥At − At−1 ∥p,q + 2σ ∥ut − ut−1 ∥q

Proof. We prove the first and the third inequalities. Another two inequalities can be proved
similarly.

∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗
≤ 2∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 2∥∇ft−1 (wt ) − ∇ft−1 (wt−1 )∥2p,∗
≤ 2∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 2L2 ∥wt − wt−1 ∥2p

where we use the smoothness of ft (w).

∥At wt − At−1 wt−1 ∥2q,∗ ≤ 2∥(At − At−1 )wt ∥2q,∗ + 2∥At−1 (wt − wt−1 )∥2p
≤ 2R12 ∥At − At−1 ∥2p,q + 2σ 2 ∥wt − wt−1 ∥2p
189

Lemma 6.6. For any w ∈ W and u ∈ Q, we have
η (Ft (wt , u) − Ft (w, ut ))
≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt )
(
2
+ 4η ∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + ∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗
)
2
2
2
+ (R1 + R2 )∥At − At−1 ∥p,q
+ 4η 2 σ 2 ∥wt − wt−1 ∥2p + 4η 2 L2 ∥wt − wt−1 ∥2p −
+ 4η 2 σ 2 ∥ut − ut−1 ∥2p + 4η 2 L2 ∥ut − ut−1 ∥2q −

α
(∥wt − zt ∥2p + ∥wt − zt−1 ∥2p )
2

α
(∥ut − vt ∥2q + ∥ut − vt−1 ∥2q ).
2

Proof. The lemma can be proved by combining the results in Lemma 6.5 and the inequality
in (6.8).

η (Ft (wt , u) − Ft (w, ut ))
≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt )
⊤
2
+ 2η 2 ∥∇ft (wt ) − ∇ft−1 (wt−1 )∥2p,∗ + 2η 2 ∥A⊤
t ut − At−1 ut−1 ∥p,∗

+ 2η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut−1 )∥2q,∗ + 2η 2 ∥At wt − At−1 wt−1 ∥2q,∗
) α(
)
α(
−
∥wt − zt−1 ∥2p + ∥wt − zt ∥2p −
∥ut − vt−1 ∥2q + ∥ut − vt ∥2q
2
2
≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt )
+ 4η 2 ∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + 4η 2 L2 ∥wt − wt−1 ∥2p + 4η 2 σ 2 ∥wt − wt−1 ∥2p
+ 4η 2 R12 ∥At − At−1 ∥2p,q + 4η 2 R22 ∥At − At−1 ∥2p,q
+ 4η 2 ∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗ + 4η 2 L2 ∥ut − ut−1 ∥2q + 4η 2 σ 2 ∥ut − ut−1 ∥2q
) α(
)
α(
−
∥wt − zt−1 ∥2p + ∥wt − zt ∥2p −
∥ut − vt−1 ∥2q + ∥ut − vt ∥2q
2
2
190

≤ B1 (w, zt−1 ) − B1 (w, zt ) + B2 (u, vt−1 ) − B2 (u, vt )
(
2
+ 4η ∥∇ft (wt ) − ∇ft−1 (wt )∥2p,∗ + ∥∇ϕt (ut ) − ∇ϕt−1 (ut )∥2q,∗
)
2
2
2
+ (R1 + R2 )∥At − At−1 ∥p,q
+ 4η 2 σ 2 ∥wt − wt−1 ∥2p + 4η 2 L2 ∥wt − wt−1 ∥2p −
+ 4η 2 σ 2 ∥ut − ut−1 ∥2p + 4η 2 L2 ∥ut − ut−1 ∥2q −

α
(∥wt − zt ∥2p + ∥wt − zt−1 ∥2p )
2

α
(∥ut − vt ∥2q + ∥ut − vt−1 ∥2q ).
2

Proof of Theorem 6.3. Taking summation of the inequalities in Lemma 6.6 over t = 1, . . . , T ,
√
√
applying the inequality in (5.32) twice and using η ≤ α/(4 σ 2 + L2 ), we have
T
∑
t=1

Ft (wt , u) −

T
∑

Ft (w, ut ) ≤ 4ηEGVT,p,q +

t=1

M1 + M2
η

( √
)
√
(M1 + M2 )(σ 2 + L2 ) √
= 4 M1 + M2 max 2
, EGVT,p,q .
α

We complete the proof by using

w∗ = arg min
w∈W

T
∑

Ft (wt , u)

t=1

and
u∗ = arg max
u∈Q

T
∑

Ft (w, ut )

t=1

.
As an immediate result of Theorem 6.3, the following Corollary states a regret bound for

191

non-smooth optimization with a fixed non-smooth component that can be written as a max
structure, i.e., ft (w) = ft (w) + [g(w) = maxu∈Q ⟨Aw, u⟩ − ϕ(u)].
Corollary 6.7. Let ft (w) = ft (w) + g(w), t = 1, . . . , T be a sequence of non-smooth functions, where g(w) = maxu∈Q ⟨Aw, u⟩ − ϕ(u), and the gradual variation EGVT be defined in (5.13) w.r.t the dual norm ∥ · ∥p,∗ . Assume ft (w) are L-smooth functions w.r.t
∥ · ∥, the domain W, Q and A satisfy the boundness condition as in (6.7). If we set
(√
)
√
√
√
η = min
M1 + M2 /(2 EGVT ), α/(4 σ 2 + L2 ) in Algorithm 9, then we have the following regret bound
T
∑

ft (wt ) − min

w∈W

t=1

T
∑

ft (w)

t=1

( √
)
√
(M1 + M2 )(σ 2 + L2 ) √
≤ 4 M1 + M2 max 2
, EGVT + V(g, w1:T ),
α

where wT =

∑T

t=1 wt /T and V(g, w1:T ) =

∑T

t=1 |g(wt ) − g(wT )| measures the variation

in the non-smooth component.
Proof. In the case of fixed non-smooth component, the gradual variation defined in (6.8)
reduces the one defined in (5.13) w.r.t the dual norm ∥ · ∥p,∗ . By using the bound in (6.9)
and noting that ft (w) = maxu∈Q Ft (w, u) ≥ Ft (w, ut ), we have

) ∑
T (
T
∑
ft (wt ) + ⟨Awt , u⟩ − ϕ(u) ≤
ft (w)+
t=1

t=1

( √
)
√
(M1 + M2 )(σ 2 + L2 ) √
, EGVT .
4 M1 + M2 max 2
α

192

Therefore
) ∑
T (
T
∑
ft (wt ) + g(wT ) ≤
ft (w)+
t=1

t=1

( √
)
√
(M1 + M2 )(σ 2 + L2 ) √
, EGVT .
4 M1 + M2 max 2
α

We complete the proof by complementing ft (wt ) with g(wt ) to obtain ft (wt ) and moving
∑
the additional term Tt=1 (g(wT ) − g(wt )) to the right hand side .
Remark 6.8. Note that the regret bound in Corollary 6.7 has an additional term V (g, w1:T )
compared to the regret bound in Corollary 6.2, which constitutes a tradeoﬀ between the reduced
computational cost in solving a composite gradient mapping.
To see an application of Theorem 6.3 to an online non-smooth optimization with timevarying non-smooth components, let us consider the example of online classification with
hinge loss. At each trial, upon receiving an example xt , we need to make a prediction based
on the current model wt , i.e., yt = ⟨wt , xt ⟩, then we receive the true label of xt denoted
by yt ∈ {+1, −1}. The goal is to minimize the total number of mistakes across the time
line MT =

∑T

t=1 I[yt yt ≤ 0]. Here we are interested in a scenario that the data sequence

(xt , yt ), t = 1, . . . , T has a small gradual variation in terms of yt xt . To obtain such a gradual
variation based mistake bound, we can apply Algorithm 9. For the purpose of deriving the
mistake bound, we need to make a small change to Algorithm 9. At the beginning of each
trial, we first make a prediction yt = ⟨wt , xt ⟩, and if we make a mistake I[yt yt ≤ 0] the we
proceed to update the auxiliary primal-dual pair (wt′ , βt ) similar to (zt , vt ) in Algorithm 9
and the primal-dual pair (wt+1 , αt+1 ) similar to (wt+1 , ut+1 ) in Algorithm 9, which are

193

given explicitly as follows:

βt =

∏(

)

βt−1 + η(1 − wt⊤ yt xt ) ,

wt′ =

αt+1 =

′
(wt−1
+ ηαt yt xt )

∥w∥2 ≤R

[0,1]

∏(

∏

)
⊤
βt + η(1 − wt yt xt ) ,

wt+1 =

∏

(wt′ + ηαt yt xt ).

∥w∥2 ≤R

[0,1]

Without loss of generality, we let (xt , yt ), t = 1, . . . , MT denote the examples that are predicted incorrectly. The function Ft (·, ·) is written as Ft (w, α) = α(1 − yt ⟨w, xt ⟩). Then for
a total sequence of T examples, we have the following bound by assuming ∥wt ∥2 ≤ 1 and
√
η ≤ 1/2 2
MT

∑

Ft (wt , α) ≤

t=1

MT
∑

ℓ(yt wt⊤ xt ) + η

MT −1
∑

t=1

(R2 + 1)∥yt+1 xt+1 − yt xt ∥22 +

t=0

R 2 + α2
.
2η

Since yt wt⊤ xt is less than 0 for the incorrectly predicted examples, if we set α = 1 in the
above inequality, we have

MT ≤
≤

MT
∑
t=1
MT
∑

ℓ(yt wt⊤ xt ) + η

MT −1

∑

(R2 + 1)∥yt+1 xt+1 − yt xt ∥22 +

t=0

ℓ(yt wt⊤ xt ) +

R2 + 1
2η

√
√
2(R2 + 1) max(2, EGVT,2 ).

t=1

which results in a gradual variational mistake bound, where

√
EVGT,2 measures the gradual

variation in the incorrectly predicted examples. To end the discussion, we note that one may
find applications of a small gradual variation of yt xt in time series classification. For instance,
if xt represent some medical measurements of a person and yt indicates whether the person
observes a disease, since the health conditions usually change slowly then it is expected that

194

the gradual variation of yt xt is small. Similarly, if xt are some sensor measurements of an
equipment and yt indicates whether the equipment fails or not, we would also observe a
small gradual variation of the sequence yt xt during a time period.

6.3

Summary

In this chapter we developed a simplified online mirror prox method using a composite
gradient mapping for non-smooth optimization with a fixed non-smooth component and
a primal-dual prox method for non-smooth optimization with the non-smooth component
written as a max structure. Despite the impossibility result in Chapter 5 which demonstrated
that smoothness of loss functions in necessary to obtain gradual variation bounds, we showed
that a simplified version of online mirror prix method is able to attain regret bounded by
gradual variation for loss functions with a smooth component and two types of mentioned
non-smooth components.

195

Chapter 7
Mixed Optimization for Smooth
Losses
In this part of the thesis, we consider stochastic convex optimization problem and show that
leveraging the smoothness of functions allows us to devise stochastic optimization algorithms
that enjoy faster convergence rate.
The focus of this chapter is on stochastic smooth optimization. The motivation for
exploiting smoothness in stochastic optimization stems from the observation that the optimal
√
convergence rate for stochastic optimization of smooth functions is O(1/ T ), which is same
as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to
optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ).
Therefore, it is of great interest to exploit smoothness in stochastic setting as well. In
particular, we are interested in designing an eﬃcient algorithm that is in the same spirit of
the stochastic gradient descent method, but can eﬀectively leverage the smoothness of the
loss function to achieve a significantly faster convergence rate.
We introduce a new setup for optimizing convex functions, termed as mixed optimization, which allows to access both a stochastic oracle and a full gradient oracle to take
advantages of their individual merits. Our goal is to significantly improve the convergence
rate of stochastic optimization of smooth functions by having an additional small number of

196

accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient
oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm
is able to achieve an optimization error of O(1/T ). The key insight underlying the mixed
optimization paradigm is that by infrequent use of full gradients at specific points we are able
to progressively reduce the variance of stochastic gradients which leads to faster convergence
rates.
The rest of this chapter is organized as follows. In Section 7.1 we motivate the problem.
Section 7.2 describes the MixedGrad algorithm, discusses the main intuition behind it, and
states the main result on its convergence rate. The proof of convergence rate is given in
Section 7.3 and the omitted proofs are deferred to Section 7.4. Section 7.5 concludes the
paper and discusses few open questions. Finally, Section 7.6 briefly reviews the literature on
deterministic and stochastic optimization.

7.1

Motivation

As it has been shown in Chapter 2, many practical machine learning algorithms follow the
framework of empirical risk minimization, which often can be cast into the following generic
optimization problem:
1∑
min F(w) :=
fi (w),
n
w∈W
n

(7.1)

i=1

where n is the number of training examples, fi (w) encodes the loss function related to
the ith training example (xi , yi ), and W is a bounded convex domain that is introduced to
regularize the solution w ∈ W (i.e., the smaller the size of W, the stronger the regularization

197

is). In this chapter, we focus on the learning problems for which the loss function fi (w) is
smooth. Examples of smooth loss functions include least square with fi (w) = (yi − ⟨w, xi ⟩)2
and logistic regression with fi (w) = log (1 + exp(−yi ⟨w, xi ⟩)). Since the regularization is
enforced through the restricted domain W, we did not introduce a ℓ2 regularizer λ∥w∥2 /2
into the optimization problem and as a result, we do not assume the loss function to be
strongly convex. We note that a small ℓ2 regularizer does NOT improve the convergence
rate of stochastic optimization. More specifically, the convergence rate for stochastically
√
√
optimizing a ℓ2 regularized loss function remains as O(1/ T ) when λ = O(1/ T ) [72], a
scenario that is often encountered in real-world applications.
A preliminary approach for solving the optimization problem in (7.1) is the batch gradient
descent (GD) algorithm [121]. It starts with some initial point, and iteratively updates the
solution using the equation wt+1 = ΠW (wt − η∇F (wt )), where ΠW (·) is the orthogonal
projection onto the convex domain W. It has been shown that for smooth objective functions,
the convergence rate of standard GD is O(1/T ) [121], and can be improved to O(1/T 2 ) by an
accelerated GD algorithm [120, 121, 123]. The main shortcoming of GD method is its high
cost in computing the full gradient ∇F(wt ) when the number of training examples is large,
i.e., it requires O(n) gradient computations per iteration. Stochastic gradient descent (SGD)
alleviates this limitation of GD by sampling one (or a small set of) examples and computing
a stochastic (sub)gradient at each iteration based on the sampled examples [29, 115, 134].
Since the computational cost of SGD per iteration is independent of the size of the data (i.e.,
n), it is usually appealing for large-scale learning and optimization.
While SGD enjoys a high computational eﬃciency per iteration, it suﬀers from a slow
convergence rate for optimizing smooth functions. It has been shown in [119] that the
√
eﬀect of the stochastic noise cannot be decreased with a better rate than O(1/ T ) which
198

is significantly worse than GD that uses the full gradients for updating the solutions and
this limitation is also valid when the target function is smooth. In addition for general
Lipschitz continuous convex functions, SGD exhibits the same convergence rate as that for
the smooth functions, implying that smoothness of the loss function is essentially not very
useful and can not be exploited in stochastic optimization. The slow convergence rate for
stochastically optimizing smooth loss functions is mostly due to the variance in stochastic
gradients: unlike the full gradient case where the norm of a gradient approaches to zero when
the solution is approaching to the optimal solution, in stochastic optimization, the norm of
a stochastic gradient is constant even when the solution is close to the optimal solution. It is
√
the variance in stochastic gradients that makes the convergence rate O(1/ T ) unimprovable
for stochastic smooth optimization [119, 4].
In this chapter, we are interested in designing an eﬃcient algorithm that is in the same
spirit of SGD but can eﬀectively leverage the smoothness of the loss function to achieve a
significantly faster convergence rate. To this end, we consider a new setup for optimization
that allows us to interplay between stochastic and deterministic gradient descent methods.
In particular, we assume that the optimization algorithm has an access to two oracles:
• A stochastic oracle Os that returns the loss function fi (w) based on the sampled
training example (xi , yi ) 1 , and
• A full gradient oracle Of that returns the gradient ∇F(w) for any given solution
w ∈ W.
We refer to this new setting as mixed optimization in order to distinguish it from both
stochastic and full gradient optimization models. Obviously, the challenging issue in this
1 We note that the stochastic oracle assumed in our study is slightly stronger than the stochastic gradient
oracle as it returns the sampled function instead of the stochastic gradient.

199

regard is to minimize the number of full gradients to be as minimum as possible while
having the same number of stochastic gradient accesses. The key question we examined in
this chapter is:
”Is it possible to improve the convergence rate for stochastic optimization of smooth functions by having a small number of calls to the full gradient oracle Of ? ”
In this chapter we give an aﬃrmative answer to this question. In particular, we show that
with an additional O(ln T ) accesses to the full gradient oracle Of , the proposed algorithm,
referred to as MixedGrad, can improve the convergence rate for stochastic optimization of
smooth functions to O(1/T ), the same rate for stochastically optimizing a strongly convex
function [72, 125, 137]. The MixedGrad algorithm builds oﬀ on multi-stage methods [72]
and operates in epochs, but involves novel ingredients so as to obtain an O(1/T ) rate for
smooth losses. In particular, we form a sequence of strongly convex objective functions to be
optimized at each epoch and decrease the amount of regularization and shrink the domain as
the algorithm proceeds. The full gradient oracle Of is only called at the beginning of each
epoch. In this chapter our focus is only smooth functions and in the coming chapters, we
show that the it is further possible to develop faster optimization schemes when the functions
is both smooth and strongly convex by making the number of full gradients independent of
condition number.

7.2

The MixedGrad Algorithm

In stochastic first-order optimization setting, instead of having direct access to F(w), we
only have access to a stochastic gradient oracle, which given a solution w ∈ W, returns
the gradient ∇fi (w) where i is sampled uniformly at random from {1, 2, · · · , n}. The goal

200

of stochastic optimization to use a bounded number T of oracle calls, and compute some
¯ ∈ W such that the optimization error, F(w)
¯ − F (w∗ ), is as small as possible.
w
In the mixed optimization model considered in this study, we first relax the stochastic
oracle Os by assuming that it will return a randomly sampled loss function fi (w), instead of
the gradient ∇fi (w) for a given solution w may be applied to achieve O(1/T ) convergence.
Second, we assume that the learner also has an access to the full gradient oracle Of . Our
goal is to significantly improve the convergence rate of stochastic gradient descent (SGD) by
making a small number of calls to the full gradient oracle Of . In particular, we show that by
having only O(log T ) accesses to the full gradient oracle and O(T ) accesses to the stochastic
oracle, we can tolerate the noise in stochastic gradients and attain an O(1/T ) convergence
rate for optimizing smooth functions.
We now turn to describe the proposed mixed optimization algorithm and state its convergence rate. The detailed steps of MixedGrad algorithm are shown in Algorithm 10. It
follows the epoch gradient descent algorithm proposed in [72] for stochastically minimizing
strongly convex functions and divides the optimization process into m epochs, but involves
novel ingredients so as to obtain an O(1/T ) convergence rate. The key idea is to introduce
a ℓ2 regularizer into the objective function to make it strongly convex, and gradually reduce
the amount of regularization over the epochs. We also shrink the domain as the algorithm
proceeds. We note that reducing the amount of regularization over time is closely-related to
the classic proximal-point algorithms. Throughout the chapter, we will use the subscript for
the index of each epoch, and the superscript for the index of iterations within each epoch.
Below, we describe the key idea behind the MixedGrad algorithm.
¯ k be the solution obtained before the kth epoch, which is initialized to be 0 for
Let w
¯ k,
the first epoch. Instead of searching for w∗ at the kth epoch, our goal is to find w∗ − w
201

Algorithm 10 MixedGrad Algorithm
1: Input:
• step size η1
• domain size ∆1
• the number of iterations T1 for the first epoch
• the number of epochs m
• regularization parameter λ1
• shrinking parameter γ > 1
¯1 = 0
2: Initialize: w
3: for k = 1, . . . , m do
4:
Construct the domain Wk = {w : w + wk ∈ W, ∥w∥ ≤ ∆k }
¯ k)
5:
Call the full gradient oracle Of for ∇F(w
∑
¯ k + ∇F(w
¯ k ) = λk w
¯ k + n1 n
¯ k)
6:
Compute gk = λk w
i=1 ∇fi (w
1
7:
Initialize wk = 0
8:
for t = 1, . . . , Tk do
9:
Call stochastic oracle Os to return a randomly selected loss function fit (w)
k
ˆkt = gk + ∇fit (wkt + w
¯ k ) − ∇fit (w
¯ k)
10:
Compute the stochastic gradient as g
k
k
11:
Update the solution by
1
ˆkt + λk wkt ⟩ + ∥w − wkt ∥2
wkt+1 = arg max ηk ⟨w − wkt , g
2
w∈W
k

end for
1 ∑T +1 wt and w
¯ k+1 = w
¯ k + wk+1
Set wk+1 = T +1
t=1
k
14:
Set ∆k+1 = ∆k /γ, λk+1 = λk /γ, ηk+1 = ηk /γ, and Tk+1 = γ 2 Tk
15: end for
¯ m+1
Return w
12:
13:

resulting in the following optimization problem for the kth epoch
λk
1∑
¯ k ∥2 +
¯ k ),
fi (w + w
∥w + w
2
n
n

min
w + wk ∈ W

(7.2)

i=1

∥w∥ ≤ ∆k

where ∆k specifies the domain size of w and λk is the regularization parameter introduced
at the kth epoch. By introducing the ℓ2 regularizer, the objective function in (7.2) becomes
strongly convex, making it possible to exploit the technique for stochastic optimization of
202

strongly convex function in order to improve the convergence rate. The domain size ∆k and
the regularization parameter λk are initialized to be ∆1 > 0 and λ1 > 0, respectively, and
are reduced by a constant factor γ > 1 every epoch, i.e., ∆k = ∆1 /γ k−1 and λk = λ1 /γ k−1 .
¯ k ∥2 /2 from the objective function in (7.2), we obtain
By removing the constant term λk ∥w
the following optimization problem for the kth epoch
[
min

w∈Wk

]
n
∑
λk
1
¯ k⟩ +
¯ k) ,
Fk (w) =
∥w∥2 + λk ⟨w, w
fi (w + w
2
n

(7.3)

i=1

where Wk = {w : w + wk ∈ W, ∥w∥ ≤ ∆k }. We rewrite the objective function Fk (w) as
n
1∑
λk
2
¯ k⟩ +
¯ k)
∥w∥ + λk ⟨w, w
fi (w + w
Fk (w) =
2
n
i=1
⟨
⟩
n
n
∑
λk
1
1∑
2
¯k +
¯ k) +
¯ k ) − ⟨w, ∇fi (w
¯ k )⟩
=
∥w∥ + w, λk w
∇fi (w
fi (w + w
2
n
n
i=1

i=1

n
1∑ k
λk
2
∥w∥ + ⟨w, gk ⟩ +
fi (w)
=
2
n

(7.4)

i=1

where
1∑
¯k +
¯ k ) and fik (w) = fi (w + w
¯ k ) − ⟨w, ∇fi (w
¯ k )⟩.
g k = λk w
∇fi (w
n
n

i=1

The main reason for using fik (w) instead of fi (w) is to tolerate the variance in the
stochastic gradients. To see this, from the smoothness assumption of fi (w) we obtain the
following inequality for the norm of fik (w) as:

¯ k ) − ∇fi (w
¯ k )∥ ≤ β∥w∥.
∇fik (w) = ∥∇fi (w + w

203

As a result, since ∥w∥ ≤ ∆k and ∆k shrinks over epochs, then ∥w∥ will approach to zero
over epochs and consequentially ∥∇fik (w)∥ approaches to zero, which allows us to eﬀectively
control the variance in stochastic gradients, a key to improving the convergence of stochastic
optimization for smooth functions to O(1/T ).
Using Fk (w) in (7.4), at the tth iteration of the kth epoch, we call the stochastic oracle Os
to randomly select a loss function f k (w) and update the solution by following the standard
it

paradigm of SGD by

wkt+1

(
)
t
t
k
t
= Πw∈W wk − ηk (λk wk + gk + ∇f t (wk ))
k
i
k
(
)
t
t
t
¯ k ) − ∇fit (w
¯ k )) ,
= Πw∈W wk − ηk (λk wk + gk + ∇fit (wk + w
k

k

(7.5)

k

where Πw∈W (·) projects the solution w into the domain Wk that shrinks over epochs.
k
At the end of each epoch, we compute the average solution wk , and update the solution
¯ k to w
¯ k+1 = w
¯ k + wk . Similar to the epoch gradient descent algorithm [72], we
from w
increase the number of iterations by a constant γ 2 for every epoch, i.e. Tk = T1 γ 2(k−1) .
In order to perform stochastic gradient updating given in (7.5), we need to compute
vector gk at the beginning of the kth epoch, which requires an access to the full gradient
oracle Of . It is easy to count that the number of accesses to the full gradient oracle Of is
m, and the number of accesses to the stochastic oracle Os is

T = T1

m
∑

γ 2(i−1) =

i=1

γ 2m − 1
T1 .
γ2 − 1

Thus, if the total number of accesses to the stochastic gradient oracle is T , the number of

204

access to the full gradient oracle required by MixedGrad algorithm is O(ln T ), consistent
with our goal of making a small number of calls to the full gradient oracle.
The theorem below shows that for smooth objective functions, by having O(ln T ) access
to the full gradient oracle Of and O(T ) access to the stochastic oracle Os , by running
MixedGrad algorithm, we achieve an optimization error of O(1/T ).
Theorem 7.1. Let δ ≤ e−9/2 be the failure probability. Set γ = 2, λ1 = 16β and

T1 = 300 ln

m
,
δ

η1 =

1
√
, and ∆1 = R.
2β 3T1

(
)
¯ m+1 be the solution returned by MixedGrad method in
Define T = T1 22m − 1 /3. Let w
Algorithm 10 after m epochs with m = O(ln T ) calls to the full gradient oracle Of and T
calls to the stochastic oracle Os . Then, with a probability 1 − 2δ, we have
80βR2
¯ m+1 ) − min F(w) ≤ 2m−2 = O
F(w
2
w∈W

7.3

( )
β
.
T

Analysis of Convergence Rate

Now we turn to proving the main theorem. The proof will be given in a series of lemmas and
theorems where the proof of few are given in Section 7.4. The proof of main theorem is based
on induction. To this end, let w∗k be the optimal solution that minimizes Fk (w) defined
in (7.3). The key to our analysis is show that when ∥w∗k ∥ ≤ ∆k , with a high probability, it
holds that ∥w∗k+1 ∥ ≤ ∆k /γ, where w∗k+1 is the optimal solution that minimizes Fk+1 (w),
as revealed by the following theorem.
Theorem 7.2. Let w∗k and w∗k+1 be the optimal solutions that minimize Fk (w) and Fk+1 (w),
205

respectively, and wk+1 be the average solution obtained at the end of kth epoch of MixedGrad
( √
)
algorithm. Suppose ∥w∗k ∥ ≤ ∆k . By setting the step size ηk = 1/ 2β 3Tk , we have, with
a probability 1 − 2δ,
λk ∆2k
∆k
k+1
∥w∗ ∥ ≤
and Fk (wk+1 ) − min Fk (w) ≤
4
w
γ

2γ

provided that δ ≤ e−9/2 and
Tk ≥

300γ 8 β 2 1
ln .
δ
λ2k

Taking this statement as given for the moment, we proceed with the proof of Theorem 7.1,
returning later to establish the claim stated in Theorem 7.2.
Proof of Theorem 7.1. It is easy to check that for the first epoch, using the fact W ∈ BR ,
we have
∥w∗1 ∥ = ∥w∗ ∥ ≤ R := ∆1 .
Let w∗m be the optimal solution that minimizes Fm (w) and let w∗m+1 be the optimal solution
obtained in the last epoch. Using Theorem 7.1, with a probability 1 − 2mδ, we have
∆1
∥w∗m ∥ ≤ m−1
,
γ

Fm (wm+1 ) − Fm (w∗m ) ≤

λ1 ∆21
λm ∆2m
=
2γ 4
2γ 3m+1

Hence,
λ1 ∆21
λ1
1∑
¯ m⟩
¯ m+1 ) ≤ Fm (w∗m ) + 3m+1
− m−1
⟨wm+1 , w
fi (w
n
2γ
γ
n

i=1

≤ Fm (w∗m ) +

206

λ1 ∆21
¯ m ∥∆
λ ∥w
+ 1 2m−2 1
3m+1
2γ
γ

where the last step uses the fact ∥w∗m+1 ∥ ≤ ∆m = ∆1 γ 1−m . Since

¯ m∥ ≤
∥w

m
∑

|wi | ≤

i=1

m
∑

∆i ≤

i=1

γ∆1
≤ 2∆1
γ−1

where in the last step holds under the condition γ ≥ 2. By combining above inequalities, we
obtain
λ1 ∆21
2λ1 ∆2
1∑
¯ m+1 ) ≤ Fm (w∗m ) + 3m+1
+ 2m−21 .
fi ( w
n
2γ
γ
n

i=1

Our final goal is to relate Fm (w) to minw L(w). Since w∗m minimizes Fm (w), for any
w∗ ∈ arg min L(w), we have
n
)
λ1 (
1∑
2
m
¯ m ∥ + 2⟨w∗ − w
¯ m, w
¯ m ⟩ . (7.6)
fi (w∗ ) + m−1 ∥w∗ − w
Fm (w∗ ) ≤ Fm (w∗ ) =
n
2γ
i=1

¯ m ∥. To this end, after the first
Thus, the key to bound |F(w∗m )−F (w∗ )| is to bound ∥w∗ − w
¯ m+1 , w
¯ m+2 , . . . be the sequence of
m epoches, we run Algorithm 10 with f ull gradients. Let w
solutions generated by Algorithm 10 after the first m epochs. For this sequence of solutions,
Theorem 7.2 will hold deterministically as we deploy the full gradient for updating, i.e.,
∥wk ∥ ≤ ∆k for any k ≥ m + 1. Since we reduce λk exponentially, λk will approach to zero
¯ k }∞
and therefore the sequence {w
k=m+1 will converge to w∗ , one of the optimal solutions
¯ k }∞
¯ k ∥ ≤ ∆k for any
that minimize L(w). Since w∗ is the limit of sequence {w
k=m+1 and ∥w
k ≥ m + 1, we have

¯ m∥ ≤
∥w∗ − w

∞
∑
i=m+1

|wi | ≤

∞
∑
k=m+1

207

∆1
2∆
∆k ≤ m
≤ m1
−1
γ
γ (1 − γ )

where the last step follows from the condition γ ≥ 2. Thus,
λ1
1∑
Fm (w∗m ) ≤
fi (w∗ ) + m−1
n
2γ
n

(

i=1

4∆21 8∆21
+ m
γ
γ 2m

)

n
n
) 1∑
2λ1 ∆21 (
5λ1 ∆2
1∑
−m
=
fi (w∗ ) + 2m−1 2 + γ
≤
fi (w∗ ) + 2m−11
n
n
γ
γ
i=1

(7.7)

i=1

By combining the bounds in (7.6) and (7.7), we have, with a probability 1 − 2mδ,
5λ1 ∆2
1∑
1∑
¯ m+1 ) −
fi (w
fi (w∗ ) ≤ 2m−21 = O(1/T )
n
n
γ
n

n

i=1

i=1

where
T = T1

m−1
∑
k=0

(

)

T1 γ 2m − 1
2k
γ =
γ2 − 1

T
≤ 1 γ 2m .
3

We complete the proof by plugging in the stated values for γ, λ1 and ∆1 .
We turn now to proving the Theorem 7.2. For the convenience of discussion, we drop the
subscript k for epoch just to simplify our notation. Let λ = λk , T = Tk , ∆ = ∆k , g = gk .
¯ =w
¯ k be the solution obtained before the start of the epoch k, and let w
¯′ = w
¯ k+1 be
Let w
the solution obtained after running through the kth epoch. We denote by F(w) and F ′ (w)
the objective functions Fk (w) and Fk+1 (w). They are given by
λ
1∑
¯ +
¯
F(w) =
∥w∥2 + λ⟨w, w⟩
fi (w + w)
2
n
n

(7.8)

i=1

F ′ (w) =

λ
1∑
λ
¯ ′⟩ +
¯ ′)
∥w∥2 + ⟨w, w
fi (w + w
2γ
γ
n
n

(7.9)

i=1

Let w∗ = w∗k and w∗′ = w∗k+1 be the optimal solutions that minimize F(w) and F ′ (w) over
the domain Wk and Wk+1 , respectively. Under the assumption that ∥w∗ ∥ ≤ ∆, our goal is
208

to show
∥w∗′ ∥ ≤

∆
,
γ

¯ ′ ) − F(w∗ ) ≤
F(w

λ∆2
2γ 4

The following lemma bounds F(wt ) − F(w∗ ) where the proof is deferred to Section 7.4.
Lemma 7.3.
∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2
−
2η
2η
2
η
+ ∇fit (wt ) + λwt + ⟨g, wt − wt+1 ⟩
2
⟨
⟩
+ ∇F(w∗ ) − ∇fit (w∗ ), wt − w∗
⟨
⟩
+ −∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗

F(wt ) − F (w∗ ) ≤

¯ 1 = 0, we
By adding the inequality in Lemma 7.7 over all iterations, using the fact w
have
T
∑

F(wt ) − F(w∗ ) ≤

t=1

+

∥w∗ ∥2 ∥wT +1 − w∗ ∥2
−
− ⟨g, wT +1 ⟩
2η
2η

∑
η∑
⟨∇F(w∗ ) − ∇fit (w∗ ), wt − w∗ ⟩
∥∇fit (wt ) + λwt ∥2 +
2
T

T

t=1

t=1

≜AT

+

T ⟨
∑

≜BT

⟩
−∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗ .

t=1
≜CT

Since g = ∇F (0) and

F(wT +1 ) − F(0) ≤ ⟨∇F (0), wT +1 ⟩ +

β
β
∥wT +1 ∥2 = ⟨g, wT +1 ⟩ + ∥wT +1 ∥2
2
2

209

using the fact F(0) ≤ F (w∗ ) + β2 ∥w∗ ∥2 and max(∥w∗ ∥, ∥wT +1 ∥) ≤ ∆, we have

−⟨g, wT +1 ⟩ ≤ F(0) − F(wT +1 ) +

β 2
∆ ≤ β∆2 − (F(wT +1 ) − F(w∗ ))
2

and therefore
T∑
+1

(
F(wt ) − F(w∗ ) ≤ ∆2

t=1

1
+β
2η

)

η
+ AT + BT + CT .
2

(7.10)

The following lemmas bound AT , BT and CT .

Lemma 7.4. For AT defined above we have AT ≤ 6β 2 ∆2 T .
The following lemma upper bounds BT and CT . The proof is based on the Bernstein’s
inequality for martingales and is given later .
Lemma 7.5. With a probability 1 − 2δ, we have
(
)
(
)
√
√
1
1
1
1
BT ≤ β∆2 ln + 2T ln
and CT ≤ 2β∆2 ln + 2T ln
.
δ
δ
δ
δ
Using Lemmas 7.8 and 7.9, by substituting the uppers bounds for AT , BT , and CT
in (7.10), with a probability 1 − 2δ, we obtain
(
)
√
T∑
+1
1
1
1
F(wt ) − F(w∗ ) ≤ ∆2
+ β + 6β 2 ηT + 3β ln + 3β 2T ln
2η
δ
δ
t=1

√
By choosing η = 1/[2β 3T ], we have
(
)
√
T∑
+1
√
1
1
F(wt ) − F(w∗ ) ≤ ∆2 2β 3T + β + 3β ln + 3β 2T ln
δ
δ
t=1

and using the fact w =

∑T +1

i=1 wt /(T + 1), we have

√
√
5β
3
ln[1/δ]
5β
3 ln[1/δ]
√
F(w) − F (w∗ ) ≤ ∆2 √
, and ∆2 = ∥w − w∗ ∥2 ≤ ∆2
.
T +1
λ T +1
210

Thus, when T ≥ [300γ 8 β 2 ln 1δ ]/λ2 , we have, with a probability 1 − 2δ,
∆2 ≤

∆2
λ
, and |F(w) − F(w∗ )| ≤ 4 ∆2 .
4
γ
2γ

(7.11)

The next lemma relates ∥w∗′ ∥ to ∥w − w∗ ∥.
Lemma 7.6. We have ∥w∗′ ∥ ≤ γ∥w − w∗ ∥.
Combining the bound in (7.11) with Lemma 7.10, we have ∥w∗′ ∥ ≤ ∆/γ.

7.4
7.4.1

Proofs of Convergence Rate
Proof of Lemma 7.7

Before proving the lemmas we recall the definition of F(w), F ′ (w), g, and fi (w) as:
λ
1∑
¯ +
¯
∥w∥2 + λ⟨w, w⟩
fi (w + w),
2
n
n

F(w) =

i=1

F ′ (w) =

λ
1∑
λ
¯ ′⟩ +
¯ ′ ),
∥w∥2 + ⟨w, w
fi (w + w
2γ
γ
n
n

i=1

¯+
g = λw

1
n

n
∑

¯
∇fi (w),

i=1

¯ − ⟨w, ∇fi (w)⟩.
¯
fi (w) = fi (w + w)
We also recall that w∗ and w∗′ are the optimal solutions that minimize F(w) and F ′ (w)
over the domain Wk and Wk+1 , respectively.

211

Lemma 7.7.
∥wt+1 −w∗ ∥2
∥wt −w∗ ∥2
−
+ η2
2η
2η

F(wt ) − F(w∗ ) ≤
+

⟨
+

∇fit (wt ) + λwt

2

+ ⟨g, wt − wt+1 ⟩

⟩
⟨
∇fit (w∗ ) − ∇F(w∗ ), wt − w∗

−∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗

⟩

Proof. For each iteration t in the kth epoch, from the strong convexity of F(w) we have
λ
F(wt ) − F(w∗ ) ≤ ⟨∇F(wt ), wt − w∗ ⟩ − ∥wt − w∗ ∥2
2
⟨
⟩ λ
= ⟨g + ∇fit (wt ) + λwt , wt − w∗ ⟩ + −∇fit (wt ) + ∇F(wt ), wt − w∗ − ∥wt − w∗ ∥2 ,
2
where F(w) = n1

∑n

i=1 fi (w). We now try to upper bound the first term in the right hand

side. Since

⟨g + ∇fit (wt ) + λwt , wt − w∗ ⟩
∥wt − w∗ ∥2 ∥wt − w∗ ∥2
+
2η
2η
2
∥wt+1 − w∗ ∥2 ∥wt − w∗ ∥2
∥wt − wt+1 ∥
−
+
⟨g + ∇fit (wt ) + λwt , wt − wt+1 ⟩ −
2η
2η
2η
2
2
∥wt+1 − w∗ ∥
∥wt − w∗ ∥
⟨g, wt − wt+1 ⟩ −
+
2η
2η
[
]
∥wt − w∥2
max ⟨∇fit (wt ) + λwt , wt − w⟩ −
w
2η
2
∥wt − w∗ ∥2 η
∥wt+1 − w∗ ∥
+
+ ∥∇fit (wt ) + λwt ∥2
⟨g, wt − wt+1 ⟩ −
2η
2η
2

= ⟨g + ∇fit (wt ) + λwt , wt − w∗ ⟩ −
≤
≤
+
=

where the first inequality follows from the fact that wt+1 in the minimizer of the following

212

optimization problem:

wt+1 =

arg min
¯
w∈W∩∥w−w∥≤∆

⟨g + ∇fit (wt ) + λwt , w − wt ⟩ +

∥w − wt ∥2
.
2η

Therefore, we obtain

F(wt ) − F (w∗ )
≤

∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 λ
−
− ∥wt − w∗ ∥2
2η
2η
2
⟩
2 ⟨
η
∇fit (wt ) + λwt + ∇F(w∗ ) − ∇fit (w∗ ), wt − w∗
+⟨g, wt − wt+1 ⟩ +
2
⟨
⟩
+ −∇fit (wt ) + ∇fit (w∗ ) − ∇F(w∗ ) + ∇F(wt ), wt − w∗ ,

as desired.
We now turn to prove the upper bound on AT .
Lemma 7.8.
AT ≤ 6β 2 ∆2 T
Proof. We bound AT as

AT =
≤
≤

T
∑
t=1
T
∑
t=1
T
∑

∥∇fit (wt ) + λwt ∥2
2∥∇fit (wt )∥2 + 2λ2 ∥wt ∥2
2λ2 ∆2 + 2∥∇fit (wt ) − ∇fit (w∗ ) + ∇fit (w∗ )∥2

t=1
≤ 6β 2 ∆2 T

213

where the second inequality follows (a + b)2 ≤ 2(a2 + b2 ) and the last inequality follows from
the smoothness assumption.
Lemma 7.9. With a probability 1 − 2δ, we have
(
BT ≤ β∆2

1
ln +
δ

√

1
2T ln
δ

)

(
and CT ≤ 2β∆2

1
ln +
δ

√

1
2T ln
δ

)

The proof is based on the Berstein inequality for martingales stated in Theorem A.25.
Equipped with this concentration inequality, we are now in a position to upper bound BT
and CT as follows.
Proof of Lemma 7.9. Denote Xt = ⟨∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ ⟩. We have that the
conditional expectation of Xt , given randomness in previous rounds, is Et−1 [Xt ] = 0. We
now apply Theorem A.25 to the sum of martingale diﬀerences. In particular, we have, with
a probability 1 − e−t ,

√
√
2
BT ≤
Kt + 2Σt
3

where

K =
Σ =

max ⟨∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ ⟩ ≤ 2β∆2
1≤t≤T
T
∑

[
]
Et |⟨∇fit (w∗ ) − ∇F(w∗ ), wt − w∗ ⟩|2 ≤ β 2 ∆4 T

t=1

Hence, with a probability 1 − δ, we have
(

1
BT ≤ β∆2 ln +
δ

214

√

1
2T ln
δ

)

Similar, for CT , we have, with a probability 1 − δ,
(
CT ≤ 2β∆2

1
ln +
δ

√

1
2T ln
δ

)

Lemma 7.10. ∥w∗′ ∥ ≤ γ∥w − w∗ ∥.
Proof. We rewrite F(w) as
λ
1∑
¯ +
¯
F(w) =
∥w∥2 + λ⟨w, w⟩
fi (w + w)
2
n
n

i=1

λ
1∑
¯ +
¯ ′)
=
∥w − w + w∥2 + λ⟨w − w + w, w⟩
fi (w − w + w
2
n
n

i=1

Define z = w − w. We have
1∑
λ
¯ + λ⟨w, w⟩
¯ +
¯ ′)
∥z + w∥2 + λ⟨z, w⟩
fi (z + w
F(w) =
2
n
n

i=1

n
λ
1∑
λ
¯ ′⟩ +
¯ ′ ) + ∥w∥2 + λ⟨w, w⟩
¯
=
∥z∥2 + λ⟨z, w
fi (z + w
2
n
2
i=1

λ
¯
= F(z) + ∥w∥2 + λ⟨w, w⟩
2

where
λ
1∑
¯ ′⟩ +
¯ ′)
F(z) = ∥z∥2 + λ⟨z, w
fi (z + w
2
n
n

i=1

Define w∗ = w∗ − w. Evidently, w∗ minimizes F(w). The only diﬀerence between F(w)
and F ′ (w) is that they use diﬀerent modulus of strong convexity λ. Thus, following [153],

215

we have
∥w∗ − w∗′ ∥ ≤

1 − γ −1
∥w∗ ∥ ≤ (γ − 1)∥w∗ ∥
γ −1

Hence,
∥w∗′ ∥ ≤ γ∥w∗ ∥ = γ∥w∗ − w∥
which completes the proofs.

7.5

Summary

We presented a new paradigm for optimization, termed as mixed optimization, that aims to
improve the convergence rate of stochastic optimization by making a small number of calls
to the full gradient oracle. We proposed the MixedGrad algorithm and showed that it is able
to achieve an O(1/T ) convergence rate by accessing stochastic and full gradient oracles for
O(T ) and O(log T ) times, respectively. We showed that the MixedGrad algorithm is able to
exploit the smoothness of the function, which is believed to be not very useful in stochastic
optimization. The key insight behind the MixedGrad algorithm is to use infrequent full
gradients to progressively reduce the variance of stochastic gradients as the optimization
proceeds.
There are few directions that are worthy of investigation. First, it would be interesting
to examine the optimality of our algorithm, namely if it is possible to achieve a better convergence rate for stochastic optimization of smooth functions using O(ln T ) accesses to the
full gradient oracle. Furthermore, to alleviate the computational cost caused by O(log T )
accesses to the full gradient oracle, it would be interesting to empirically evaluate the proposed algorithm in a distributed framework by distributing the individual functions among
216

processors to parallelize the full gradient computation at the beginning of each epoch which
requires O(log T ) communications between the processors in total. Lastly, it is very interesting to check whether an O(1/T 2 ) rate could be achieved by an accelerated method in the
mixed optimization scenario, and whether linear convergence rates could be achieved in the
strongly-convex case.

7.6

Bibliographic Notes

Deterministic Smooth Optimization. The convergence rate of gradient based methods
usually depends on the analytical properties of the objective function to be optimized. When
the objective function is strongly convex and smooth, it is well known that a simple GD
method can achieve a linear convergence rate [33]. For a non-smooth Lipschitz-continuous
√
function, the optimal rate for the first order method is only O(1/ T ) [121]. Although
√
O(1/ T ) rate is not improvable in general, several recent studies are able to improve this
rate to O(1/T ) by exploiting the special structure of the objective function [123, 122]. In
the full gradient based convex optimization, smoothness is a highly desirable property. It
has been shown that a simple GD achieves a convergence rate of O(1/T ) when the objective
function is smooth, which is further can be improved to O(1/T 2 ) by using the accelerated
gradient methods [120, 123, 121].
Stochastic Smooth Optimization. Unlike the optimization methods based on full gradients, the smoothness assumption was not exploited by most stochastic optimization methods.
√
In fact, it was shown in [119] that the O(1/ T ) convergence rate for stochastic optimization
cannot be improved even when the objective function is smooth. This classical result is
further confirmed by the recent studies of composite bounds for the first order optimization
methods [17, 96]. The smoothness of the objective function is exploited extensively in mini-

217

batch stochastic optimization [44, 47], where the goal is not to improve the convergence rate
but to reduce the variance in stochastic gradients and consequentially the number of times
for updating the solutions [154]. We finally note that the smoothness assumption coupled
with the strong convexity of function is beneficial in stochastic setting and yields a geometric
convergence in expectation using Stochastic Average Gradient (SAG) and Stochastic Dual
Coordinate Ascent (SDCA) algorithms proposed in [128] and [135], respectively.
Finally, we would like to distinguish mixed optimization from hybrid methods that use
growing sample-sizes as optimization method proceeds to gradually transform the iterates
into the full gradient method [60], which makes the iterations to be dependent to the sample
size n as opposed to SGD. In contrast, MixedGrad is as an alternation of deterministic and
stochastic gradient steps, with diﬀerent of frequencies for each type of steps. Our result for
mixed optimization is useful for the scenario when the full gradient of the objective function
can be computed relatively eﬃcient although it is still significantly more expensive than
computing a stochastic gradient. An example of such a scenario is distributed computing
where the computation of full gradients can be speeded up by having it run in parallel on
many machines with each machine containing a relatively small subset of the entire training
data. Of course, the latency due to the communication between machines will result in an
additional cost for computing the full gradient in a distributed fashion.

218

Chapter 8
Mixed Optimization for Smooth and
Strongly Convex Losses
In the preceding chapter, we presented a new paradigm for stochastic optimization that allowed us to leverage the smoothness of objective function to devise faster algorithms. In this
chapter of thesis, we continue our study of eﬃcient optimization algorithms in mixed optimization regime and show that we may leverage the smoothness assumption of loss functions
to devise algorithms with iteration complexities that are independent of condition number
in accessing the full gradient oracle. To motivate the setting considered in this chapter,
consider the optimization of smooth and strongly convex functions where the optimal itera√
tion complexity of the gradient-based algorithm is O( κ log 1/ϵ), where κ is the condition
number of the objective function to be optimized. In the case that the optimization problem is ill-conditioned, we need to evaluate a larger number of full gradients, which could be
computationally expensive despite the linear convergence rate of the algorithm in terms of
the target accuracy ϵ.
In this chapter, we propose to reduce the number of full gradients required by allowing
the algorithm to access the stochastic gradients of the objective function. To this end, we
present an algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize
two kinds of gradients similar to Chapter 7. As same as MixedGrad a distinctive step in

219

EMGD is the mixed gradient descent, where we use a combination of the full gradient and
the stochastic gradient to update the intermediate solutions. By performing a fixed number
of mixed gradient descents, we are able to improve the sub-optimality of the solution by
a constant factor, and thus achieve a linear convergence rate. Theoretical analysis shows
that EMGD is able to find an ϵ-optimal solution by computing O(log 1/ϵ) full gradients and
O(κ2 log 1/ϵ) stochastic gradients. We also provide experimental evidence complementing
our theoretical results for classification problem on few medium-sized data sets.

8.1

Introduction

The optimal iteration complexities for some popular optimization methods considering different combinations of characteristics of the objective function are shown in Table 8.1. We
observe that when the objective function is smooth (and strongly convex), the convergence
rate for full gradient descent is much faster than that for stochastic gradient descent. On the
other hand, the evaluation of a stochastic gradient is usually significantly more eﬃcient than
that for a full gradient. Thus, replacing full gradients with stochastic gradients essentially
trades the number of iterations with a low computational cost per iteration.
In this chapter, we consider the case when the objective function is both smooth and
√
strongly convex, where the optimal iteration complexity is O( κ log 1ϵ ) if the optimization
method is first order and has access to the full gradients. For the optimization problems that
are ill-conditioned, the condition number κ can be very large, leading to many evaluations
of full gradients, an operation that is computationally expensive for large data sets. To
reduce the computational cost, we are interested in the possibility of making the number of
√
full gradients required independent from κ. Although the O( κ log 1ϵ ) rate is in general not

220

Lipschitz continuous
( )

Full Gradient
Stochastic Gradient

Smooth
( )
O √Lϵ
( )
O 12

1
( ϵ2 )
O 12
ϵ

O

ϵ

Smooth & Strongly Convex
O

(√
)
κ log 1ϵ
( )
1
O λϵ

Table 8.1: The optimal iteration complexity of convex optimization. L and λ are the moduli
of smoothness and strongly convexity, respectively. κ = L/λ is the conditional number.

improvable for any first order method, we bypass this diﬃculty by allowing the algorithm
to have access to both full and stochastic gradients. Our objective is to reduce the iteration
√
complexity from O( κ log 1ϵ ) to O(log 1ϵ ) by replacing most of the evaluations of full gradients with the evaluations of stochastic gradients. Under the assumption that stochastic
gradients can be computed eﬃciently, this tradeoﬀ could lead to a significant improvement
in computational eﬃciency.
We propose an eﬃcient algorithm, dubbed Epoch Mixed Gradient Descent (EMGD),
which fits the mixed optimization regime introduced in Chapter 7. The proposed EMGD
algorithm divides the optimization process into a sequence of diﬀerent epochs, an idea that
is borrowed from the epoch gradient descent [72]. In each epoch, the proposed algorithm
performs mixed gradient descent by evaluating one full gradient and O(κ2 ) stochastic gradients. It achieves a constant reduction in the optimization error for every epoch, leading
to a linear convergence rate. Our analysis shows that EMGD is able to find an ϵ-optimal
solution by computing O(log 1ϵ ) full gradients and O(κ2 log 1ϵ ) stochastic gradients. In other
words, with the help of stochastic gradients, the number of full gradients required is reduced
√
from O( κ log 1ϵ ) to O(log 1ϵ ), independent from the condition number.

221

8.2

The Epoch Mixed Gradient Descent Algorithm

First, we recall from Chapter 2 that we wish to solve the following optimization problem
∫
min F(w) for F(w) = E[f (w, ξ)] =

w∈W

f (w, ξ)dP (ξ),

(8.1)

Ξ

where W is a convex domain, and f (w, ξ) is a convex function with respect to the first
argument. An special setting which is more appropriate for machine learning tasks is the
case when the objective function can be written as a sum of finite number of convex functions,
i.e.,
1∑
F(w) =
f (w, ξi ).
n
n

(8.2)

i=1

For learning problems such as classification and regression, each individual function in the
summand can be considered as the prediction loss on the ith training example ξi = (xi , yi ) for
a fixed loss function, i.e., f (w, ξi ) = ℓ(w; (xi , yi )). For simplicity of exposition, we absorb the
randomness in the individual functions and use fi (w) to denote the loss on random sample ξi
instead of f (w, ξi ). We note that although the formulation in (8.2) seems attractive from a
practical point of view, but the proposed algorithm is general enough to solve any stochastic
optimization problem formulated in (8.1). As a result, in the remainder of this chapter
we base the randomness on sampling the individual functions according to the unknown
distribution defined over functions.
Similar to the setting introduced in Chapter 7, we assume there exist two oracles.
1. The first one is a gradient oracle Og , which for a given input point w returns the

222

gradient ∇F (w), that is,
Og (w) = ∇F(w).
2. The second one is a function oracle Of , each call of which returns a random function
f (w), such that
F(w) = Ef [f (w)], ∀w ∈ W,
and f (w) has Lipschitz continuous gradients with constant L, that is,

∥∇f (w) − ∇f (w′ )∥ ≤ L∥w − w′ ∥, ∀w, w′ ∈ W.

(8.3)

Although we do not define a stochastic gradient oracle directly, the function oracle Of allows
us to evaluate the stochastic gradient of F(w) at any point w ∈ W.
Notice that the assumption about the function oracle Of implies that the objective
function F(·) is also L-smooth. To see this, since ∇F(w) = Ef [f (w)], by Jensen’s inequality
we have:

∥∇F (w) − ∇F(w′ )∥ ≤ Ef ∥∇f (w) − ∇f (w′ )∥ ≤ L∥w − w′ ∥, ∀w, w′ ∈ W.

(8.4)

Besides, we further assume F(·) is λ-strongly convex, that is,

∥∇F(w) − ∇F(w′ )∥ ≥ λ∥w − w′ ∥, ∀w, w′ ∈ W.

(8.5)

From (8.4) and (8.5) it is straightforward to see that L ≥ λ. The condition number κ is
defined as the ratio between these two parameters, i.e., κ = L/λ ≥ 1.
The detailed steps of the proposed Epoch Mixed Gradient Descent (EMGD) are shown
223

in Algorithm 11, where we use the superscript for the index of epochs, and the subscript for
the index of iterations in each epoch. Similar to the MixedGrad algorithm, we divided the
optimization process into a sequence of epochs (step 4 to step 11). While in the MixedGrad
algorithm the size of epochs increases exponentially and the full gradient oracle is called at
the beginning of each epoch, the size of epochs and the number of access to the two types
of oracles in EMGD is fixed.
At the beginning of each epoch, we initialize the solution w1k to be the average solution
¯ k obtained from the last epoch, and then call the gradient oracle Og to obtain ∇F (w
¯ k ).
w
At each iteration t of epoch k, we call the function oracle Of to obtain a random function
ftk (w) and define the mixed gradient at the current solution wtk as

˜tk = ∇F (w
¯ k ) + ∇ftk (wtk ) − ∇ftk (w
¯ k ),
g

which involves both the full gradient and the stochastic gradient. The mixed gradient can be
¯ k ) and the stochastic part ∇ftk (wtk ) −
divided into two parts: the deterministic part ∇F(w
¯ k ). Due to the smoothness property of ftk (·), the norm of the stochastic part is well
∇ftk (w
bounded, which facilitates the convergence analysis.
Based on the mixed gradient, we update wtk by a gradient mapping over a shrinking
¯ k ∥ ≤ ∆k ) in step 9. Since the updating is similar to the standard
domain (i.e., W ∩ ∥w − w
gradient descent except for the domain constraint, we refer to it as mixed gradient descent
for short. At the end of the iterations for epoch k, we compute the average value of T + 1
√
solutions, instead of T solutions, and update the domain size by reducing a factor of 2.
The following theorem shows the convergence rate of the proposed algorithm.

224

Algorithm 11 Epoch Mixed Gradient Descent (EMGD) Algorithm
1: Input:
• step size η
• the initial domain size ∆1
• the number of iterations T per epoch
• the number of epochs m
¯1 = 0
2: Initialize: w
3: for k = 1, . . . , m do
¯k
4:
Set w1k = w
¯ k)
5:
Call the gradient oracle Og to obtain ∇F(w
6:
for t = 1, . . . , T do
7:
Call the function oracle Of to obtain a random function ftk (·)
8:
Compute the mixed gradient as
¯ k)
¯ k ) + ∇ftk (wtk ) − ∇ftk (w
˜tk = ∇F(w
g
9:

Update the solution by
k = arg
wt+1

1
˜tk ⟩ + ∥w − wtk ∥2
η⟨w − wtk , g
2
¯ k ∥≤∆k
w∈W∩∥w−w
min

10:
end for
√
1 ∑T +1 wk and ∆k+1 = ∆k / 2
¯ k+1 = T +1
11:
Set w
t
t=1
12: end for
¯ m+1
Return w

Theorem 8.1. Assume
1152L2 1
δ ≤ e−1/2 , T ≥
ln , and ∆1 ≥ max
δ
λ2

(√

)
2
(F(0) − F(w∗ )), ∥w∗ ∥ .
λ

(8.6)

√
¯ m+1 be the solution returned by Algorithm 11 after m epoches that
Set η = 1/[L T ]. Let w
has m access to oracle Og and mT access to oracle Of . Then, with a probability at least
1 − mδ, we have
λ[∆1 ]2
[∆1 ]2
m+1
m+1
2
¯
¯
F(w
) − F(w∗ ) ≤ m+1 , and ∥w
− w∗ ∥ ≤ m .
2
2

225

Theorem 8.1 immediately implies that EMGD is able to achieve an ϵ optimization error
by computing O(log 1ϵ ) full gradients and O(κ2 log 1ϵ ) stochastic gradients.

8.3

Analysis of Convergence Rate

The proof of Theorem 8.1 is based on induction. From the assumption about ∆1 in (8.6),
we have
¯ 1 ) − F(w∗ ) ≤
F(w

λ[∆1 ]2
¯ 1 − w∗ ∥2 ≤ [∆1 ]2 ,
, and ∥w
2

which means Theorem 8.1 is true for m = 0. Suppose Theorem 8.1 is true for m = k. That
is, with a probability at least 1 − kδ, we have

¯ k+1 ) − F(w∗ ) ≤
F(w

12
λ[∆1 ]2
k+1 − w ∥2 ≤ [∆ ] .
¯
,
and
∥
w
∗
2k+1
2k

Our goal is to show that after running the k+1-th epoch, with a probability at least 1 − (k +
1)δ, we have

¯ k+2 ) − F(w∗ ) ≤
F(w

12
λ[∆1 ]2
k+2 − w ∥2 ≤ [∆ ] .
¯
,
and
∥
w
∗
2k+2
2k+1

¯ be the solution
For the simplicity of presentation, we drop the index k for epoch. Let w
obtained from the epoch k. Given the condition

¯ − F(w∗ ) ≤
F(w)

λ 2
¯ − w∗ ∥2 ≤ ∆2 ,
∆ , and ∥w
2

226

(8.7)

we will show that that after running the T iterations in one epoch, the new solution, denoted
by w, satisfies
F(w) − F(w∗ ) ≤

λ 2
1
¯ 2 ≤ ∆2 ,
∆ , and ∥w − w∥
4
2

(8.8)

with a probability at least 1 − δ.
In the proof, we frequently use the following property of strongly convex function (see
Appendix ??).
Lemma 8.2. Let F(w) be a λ-strongly convex function over the domain W, and w∗ =
arg minw∈W F(w). Then, for any w ∈ W, we have

F(w) − F(w∗ ) ≥

λ
∥w − w∗ ∥2 .
2

(8.9)

Define

¯ F(w) = F(w) − ⟨w, g⟩, and gt (w) = ft (w) − ⟨w, ∇ft (w)⟩.
¯
g = ∇F(w),

(8.10)

The objective function can be rewritten as

F(w) = ⟨w, g⟩ + F(w).

And the mixed gradient can be rewritten as

˜ k = g + ∇gt (wt ).
g

227

(8.11)

Then, the updating rule given in Algorithm 11 becomes

wt+1 = arg

1
η⟨w − wt , g + ∇gt (wt )⟩ + ∥w − wt ∥2 .
2
¯
w∈W∩∥w−w∥≤∆
min

(8.12)

For each iteration t in the current epoch, we have

F(wt ) − F(w∗ )
(8.5)

λ
(8.13)
≤ ⟨∇F(wt ), wt − w∗ ⟩ − ∥wt − w∗ ∥2
2
⟨
⟩ λ
(8.11)
= ⟨g + ∇gt (wt ), wt − w∗ ⟩ + ∇F(wt ) − ∇gt (wt ), wt − w∗ − ∥wt − w∗ ∥2 ,
2
and
⟨g + ∇gt (wt ), wt − w∗ ⟩
=⟨g + ∇gt (wt ), wt − w∗ ⟩ −

∥wt − w∗ ∥2 ∥wt − w∗ ∥2
+
2η
2η

∥wt − wt+1 ∥2 ∥wt+1 − w∗ ∥2 ∥wt − w∗ ∥2
−
+
2η
2η
2η
2
2
∥wt − w∗ ∥
∥wt+1 − w∗ ∥
≤⟨g, wt − wt+1 ⟩ +
−
2η
2η
(
)
∥wt − w∥2
+ max ⟨∇gt (wt ), wt − w⟩ −
w
2η

(8.12), (8.13)

≤

⟨g + ∇gt (wt ), wt − wt+1 ⟩ −

=⟨g, wt − wt+1 ⟩ +

∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 η
−
+ ∥∇gt (wt )∥2 .
2η
2η
2
(8.14)

Combining (8.13) and (8.14), we have

F(wt ) − F(w∗ )
≤

∥wt − w∗ ∥2 ∥wt+1 − w∗ ∥2 λ
−
− ∥wt − w∗ ∥2
2η
2η
2
⟨
⟩
η
2
+ ⟨g, wt − wt+1 ⟩ + ∥∇gt (wt )∥ + ∇F(wt ) − ∇gt (wt ), wt − w∗ .
2

228

By adding the inequalities of all iterations, we have
T
∑

F(wt ) − F(w∗ )

t=1

¯ − w∗ ∥2 ∥wT +1 − w∗ ∥2 λ ∑
∥w
¯ − wT +1 ⟩
≤
−
−
∥wt − w∗ ∥2 + ⟨g, w
2η
2η
2
T

t=1

+

(8.15)

∑
η∑
∥∇gt (wt )∥2 +
⟨∇F(wt ) − ∇gt (wt ), wt − w∗ ⟩ .
2
T

T

t=1

t=1

≜AT

≜BT

Since F(·) is L-smooth, we have

¯ ≤ ⟨∇F (w),
¯ wT +1 − w⟩
¯ +
F(wT +1 ) − F(w)

L
¯ − wT +1 ∥2 ,
∥w
2

which implies
¯ − wT +1 ⟩
⟨g, w
L 2
∆
2
(8.7)
λ
L
≤ F(w∗ ) − F(wT +1 ) + ∆2 + ∆2
2
2
¯ − F (wT +1 ) +
≤F(w)

(8.16)

≤F(w∗ ) − F(wT +1 ) + L∆2 .
From (8.15) and (8.16), we have
T∑
+1

(
F(wt ) − F(w∗ ) ≤ ∆2

t=1

)
1
η
+ L + AT + BT .
2η
2

(8.17)

Next, we consider how to bound AT and BT . The upper bound of AT is given by

AT =

T
∑
t=1

∥∇gt (wt )∥2 =

T
∑

T
(8.3)
∑
2
2
¯
¯ 2 ≤ T L2 ∆2 . (8.18)
∥∇ft (wt )−∇ft (w)∥
≤ L
∥wt − w∥
t=1
t=1

229

To bound BT , we need the Hoeﬀding-Azuma inequality which is stated in Theorem A.20 for
completeness. Define

Vt = ⟨∇F(wt ) − ∇gt (wt ), wt − w∗ ⟩, t = 1, . . . , T.

Recall the definition of F(w) and gt (w) in (8.10). Based on our assumption about the
function oracle Of , it is straightforward to check that V1 , . . . is a martingale diﬀerence with
respect to g1 , . . .. The value of Vt can be bounded by

|Vt |

≤
≤
(8.3), (8.4)

≤

∇F(wt ) − ∇gt (wt ) ∥wt − w∗ ∥
¯ + ∥∇ft (wt ) − ∇ft (w)∥)
¯
2∆ (∥∇F(wt ) − ∇F(w)∥
¯ ≤ 4L∆2 .
4L∆∥wt − w∥

Following Theorem A.20, with a probability at least 1 − δ, we have
√
BT ≤ 4L∆2

1
2T ln .
δ

(8.19)

By adding the inequalities in (8.17), (8.18) and (8.19) together, with a probability at
least 1 − δ, we have
T∑
+1
t=1

(
F(wt ) − F (w∗ ) ≤ ∆2

)
√
ηT L2
1
1
+L+
+ 4L 2T ln
.
2η
2
δ

230

√
By choosing η = 1/[L T ], we have
T∑
+1
t=1

)
(
√
√
√
1
1
F(wt ) − F(w∗ ) ≤ L∆2
T + 1 + 4 2T log
≤ 6L∆2 2T ln .
δ
δ

and therefore
√
√
(A.2)
6L
2
ln
1/δ
12L
2 ln 1/δ
√
F(w) − F(w∗ ) ≤ ∆2 √
, and ∥w − w∗ ∥2 ≤ ∆2
.
T +1
λ T +1

Thus, when
1152L2 1
T ≥
ln ,
δ
λ2
with a probability at least 1 − δ, we have

F(w) − F(w∗ ) ≤

8.4

λ 2
1
∆ , and ∥w − w∗ ∥2 ≤ ∆2 .
4
2

Experiments

In this section, we provide experimental evidence complementing our theoretical results. In
particular we consider solving the regularized logistic regression problem formulated as:
λ
1∑
fi (w) + ∥w∥2
w n
2
n

min

i=1

where
−yi
fi (w) = log (1 + exp(−yi ⟨w, xi ⟩)) , and ∇fi (w) =
x.
1 + exp(yi ⟨w, xi ⟩) i

231

We compare the EMGD algorithm to the SGD method. SGD start with solution w1 = 0
and at each iteration samples a random function indexed by ik uniformly at random over all
n available functions and updates the solution by:
)
1 (
wt+1 = wt −
∇fik (wt ) + λwt =
λt

(
)
1
1
1−
wt − ∇fik (wt ).
t
λt

The variance of SGD method at each iteration can be computed as:
(
∇fik (wt ) + λwt −
=Eik ∇fik (wt )

2

1
−
n

1∑
∇fi (wt ) + λwt
n
n

i=1
n
∑

2

∇fi (wt )

i=1

) 2

1∑
= Eik ∇fik (wt ) −
∇fi (wt )
n
n

2

i=1

1
1∑
= ∥∇fi (wt )∥2 −
∇fi (wt )
n
n
n

2

i=1

By noting the gradient at each iteration as
(

( n
) ( n
)
) (
)
1∑
1∑
¯ + λw
¯ +
¯ − λw
¯ −
∇fik (wt ) + λwt − ∇fik (w)
∇fi (w)
∇fi (wt ) + λwt ,
n
n
i=1

i=1

the variance of the mixed gradient in EMGD is given by
(
¯ −
Eik ∇fik (wt ) − ∇fik (w)
1
1
¯ 2−
= ∥∇fi (wt ) − ∇fi (w)∥
n
n

) 2
n
n
1∑
1∑
¯ −
∇fi (w)
∇fi (wt )
n
n

i=1
n
∑

i=1
n
∑

i=1

i=1

1
∇fi (wt ) −
n

¯
∇fi (w)

2

.

We run both algorithms on two well-known Adult and RCV1 data sets. The data sets
used in this experiment were sourced from the University of California Irvine (UCI) Machine
Learning Repository [11] and are referred to as the Adult and RCV1 data sets. The Adult
data set contain information on individuals such as age, level of education and current

232

−5

10

Variance

Training Loss − Optimum

0

10

−10

10

−10

10

EMGD
SGD

0

−5

10

50
# of Gradients
(a) The training error

−15

10

100

EMGD
SGD

0

50
100
# of Gradients
(b) The variance of the (stochastic or mixed) gradient

−5

10

−5

10

Variance

Training Loss − Optimum

Table 8.2: Experimental results on the Adult data set. λ = 1e−3 and η = 0.01 in EMGD.

−10

10

−10

10

EMGD
SGD

0

50

100
150
# of Gradients
(a) The training error

−15

200

10

0

EMGD
SGD

50

100
150
200
# of Gradients
(b) The variance of the (stochastic or mixed) gradient

Table 8.3: Experimental results on the RCV1 data set. λ = 1e−5 and η = 1 in EMGD.
employment type. The Adult data set contains over 30,000 records of census information
taken in 1994 from many diverse demographics. The RCV1 is an archive of over 800,000
manually categorized newswire stories recently made available by Reuters, Ltd. for research
purposes. For EMGD, we set the number of stochastic gradient in each epoch to n, which
is the number of training data.
The results of both SGD and EMGD algorithms on the Adult data set are provided in
Figure 8.2. Figure 8.1a shows the diﬀerence between the current objective value and the

233

Testing Accuracy

Testing Accuracy

0.85
0.845
0.84
0.835
0.83

0.95

0.94

EMGD
SGD

0.825
0.82
0

0.96

10
20
# of Gradients
(a) The Adult data set

EMGD
SGD

0.93
0

30

50
# of Gradients
(b) The RCV1 data set

100

Table 8.4: The testing accuracy on RCV1 and Adult data sets.
optimum (which is obtained by running a batch algorithm for a long time) versus

The number of full gradients +

The number of stochastic gradients
.
n

Figure 8.1b shows the variances of the SGD and EMGD, respectively. It can be inferred from
the results that the variance of SGD is almost same even when the algorithm approaches
the optimal solution. As can be seen, the EMGD method is able to reduce the training error
and the variance exponentially. We obtain similar results on the RCV1 data set, which are
shown in Figure 8.3.
We also examine the testing accuracy of SGD and EMGD, which are depicted in Figure 8.4. As can be seen, in terms of testing, EMGD is slightly worse than SGD at the
beginning. This behavior is expected since EMGD needs one full gradient to initialize.

234

8.5

Discussion

Compared to the optimization algorithm that only relies on full gradients [121], the number
√
of full gradients needed in EMGD is O(log 1ϵ ) instead of O( κ log 1ϵ ). Compared to the
optimization algorithms that only relies on stochastic gradients [81, 72, 125], EMGD is more
eﬃcient since it achieves a linear convergence rate.
The proposed EMGD algorithm can also be applied to the special optimization problem
considered in [128, 135], where F(w) = n1

∑n

i=1 fi (w). To make quantitative comparisons,

let’s assume the full gradient is n times more expensive to compute than the stochastic
gradient. Table 8.5 lists the computational complexity of the algorithms that enjoy linear convergence. As can be seen, the computational complexity of EMGD is lower than
Nesterov’s algorithm [121] as long as the condition number κ ≤ n2/3 , the complexity of
SAG [128] is lower than Nesterov’s algorithm if κ ≤ n/8, and the complexity of SDCA [135]
is lower than Nesterov’s algorithm if κ ≤ n2 .1 The complexity of EMGD is on the same
order as SAG and SDCA when κ ≤ n1/2 , but higher in other cases. Thus, in terms of
computational cost, EMGD may not be the best one, but it has advantages in other aspects.
1. Unlike SAG and SDCA that only work for unconstrained optimization problem, the
proposed algorithm works for both constrained and unconstrained optimization problems, provided the constrained problem in Step 9 can be solved eﬃciently.
1 In

learning

problems,

we

usually

face

a

regularized

optimization

1 ∑n ℓ(y ; ⟨w, x ⟩) + τ ∥w∥2 , where ℓ(·; ·) is some loss fixed function.
minw∈W n
i
i
2
i=1

problem

When the norm

of the data is bounded, the smoothness parameter L can be treated as a constant. The strong convexity
parameter λ is lower bounded by τ .

As a result, as long as τ > Ω(n−2/3 ), which is a reasonable

scenario [149], we have κ < O(n2/3 ), indicating that our proposed EMGD algorithm can be applied.

235

Nesterov’s algorithm [121]
O

(√
)
κn log 1ϵ

EMGD

SAG (n ≥ 8κ) [128]

SDCA [135]

(
)
O (n + κ2 ) log 1ϵ

(
)
O n log 1ϵ

(
)
O (n + κ) log 1ϵ

Table 8.5: The computational complexity for minimizing (1/n)

∑n

i=1 fi (w)

2. Unlike the SAG and SDCA that require an Ω(n) storage space, the proposed algorithm
only requires the storage space of Ω(d), where d is the dimension of w.
3. The only step in Algorithm 11 that has dependence on n is step 5 for computing the
¯ k ). By utilizing distributed computing, the running time of this step
gradient ∇F(w
can be reduced to O(n/k), where k is the number of computers, and the convergence
rate remains the same. For SAG and SDCA , it is unclear whether they can reduce
the running time without aﬀecting the convergence rate.
4. The linear convergence of SAG and SDCA only holds in expectation, whereas the linear
convergence of EMGD holds with a high probability, which is much stronger.

8.6

Summary

In this chapter, we considered how to reduce the number of full gradients needed for smooth
and strongly convex optimization problems. Under the assumption that both the gradient
and the stochastic gradient are available, the EMGD algorithm, with the help of stochas√
tic gradients, is are able to reduce the number of gradients needed from O( κ log 1ϵ ) to
O(log 1ϵ ). In the case that the objective function is in the form of (8.1), i.e., a sum of n
smooth functions, EMGD has lower computational cost than the full gradient method [121],
if the condition number κ ≤ n2/3 . We validated our theoretical results on the convergence
236

of EMGD algorithm and in particular its ability in reducing the variance during the optimization by some experimental results on two data sets. We note that although EMGD
enjoys many nice properties, it is unclear whether it is the optimal algorithm when two kinds
of gradients are available. Finally, we provided experimental evidence complementing our
theoretical results for classification problem on few standard data sets.

8.7

Bibliographic Notes

During the last three decades, there have been significant advances in convex optimization [119, 121, 34]. In this section, we provide a brief review of the first order optimization
methods.
We first discuss deterministic optimization, where the gradient of the objective function
is available. For the general convex and Lipschitz continuous optimization problem, the iteration complexity of gradient (subgradient) descent is O( 12 ), which is optimal up to constant
ϵ

factors [119]. When the objective function is convex and smooth, the optimal optimization
scheme is the accelerated gradient descent developed by Nesterov, whose iteration complexity is O( √Lϵ ) [120, 123]. With slight modifications, the accelerated gradient descent algorithm
can also be applied to optimize the smooth and strongly convex objective function, whose
√
iteration complexity is O( κ log 1ϵ ) and is in general not improvable [121, 124]. The objective of our work is to reduce the number of access to the full gradients by exploiting the
availability of stochastic gradients.
In stochastic optimization, we have the access to the stochastic gradient, which is an unbiased estimate of the full gradient [115]. Similar to the case in deterministic optimization, if
the objective function is convex and Lipschitz continuous, stochastic gradient (sub-gradient)

237

descent is the optimal algorithm and the iteration complexity is also O( 12 ) [119, 115].
ϵ
When the objective function is strongly convex, the algorithms proposed in very recent
1 ) iteration complexity [4]. Since the converworks [81, 72, 125] achieve the optimal O( λϵ

gence rate of stochastic optimization is dominated by the randomness in the gradient [94, 62],
smoothness usually does not lead to a faster convergence rate for stochastic optimization.
From the above discussion, we observe that the iteration complexity in stochastic optimization is polynomial in 1ϵ , making it is diﬃcult to find high-precision solutions. However,
when the objective function is strongly convex and can be written as a sum of a finite number
of functions, i.e.,
1∑
fi (w),
n
n

F(w) =

i=1

where each fi (w) is smooth, the iteration complexity of some specific algorithms may exhibit
a logarithmic dependence on 1ϵ , i.e., a linear convergence rate. The first such algorithm is the
stochastic average gradient (SAG) [128], whose iteration complexity is O(n log 1ϵ ), provided
n ≥ 8κ. The second one is the stochastic dual coordinate ascent (SDCA) [135], whose
iteration complexity is O((n + κ) log 1ϵ ).

238

Chapter 9
Eﬃcient Optimization with Bounded
Projections
In this chapter we aim at developing more eﬃcient optimization methods by reducing the
number of projection steps. An examination of optimization methods for both online and
convex optimization problems introduced in Chapter 2 reveals that most of them require
projecting the updated solution at each iteration to ensure that the obtained solution stays
within the feasible domain. For complex domains (e.g., positive semidefinite cone), the
projection step can be computationally expensive, making first order optimization methods
unattractive for large-scale optimization problems. The broad question in this chapter of
the thesis is the extent to which it is possible to reduce the number of projection steps in
stochastic and online optimization algorithms.
In stochastic setting, we address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at
the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical
√
analysis shows that with a high probability, the proposed algorithms achieve an O(1/ T )
convergence rate for general convex functions, and an O(ln T /T ) rate for strongly convex
functions under mild conditions about the domain and the objective function. The key insight underlying the proposed method for strongly convex objectives is that by smoothing the

239

objective function and leveraging it in the optimization, we are able to skip the intermediate
projections. This is in contrast to other parts of the thesis where we explicitly leveraged
the smoothness assumption. To the best of our knowledge, these are the first projection-free
stochastic optimization methods.
In online setting, to tackle the computational challenge arising from the projection steps,
we consider an alternative online learning problem as follows. Instead of requiring that
each solution obeys the constraints which define the convex domain, we only require the
constraints to be satisfied in a long run. Then, the online learning problem becomes a task
to find a sequence of solutions under the long term constraints. In other words, instead of
solving the projection on each round, we allow the learner to make decisions at some rounds
which do not belong to the constraint set, but the overall sequence of chosen decisions
must obey the constraints at the end by a vanishing rate. We refer to this problem as
online learning with soft constraints. By turning the problem into an online convex-concave
√
optimization problem, we propose an eﬃcient algorithm which achieves an O( T ) regret
bound and an O(T 3/4 ) bound on the violation of the constraints. Then, we modify the
algorithm in order to guarantee that the constraints are exactly satisfied in the long run.
This gain is achieved at the price of getting an O(T 3/4 ) regret bound.

9.1

Setup and Motivation

In stochastic setting, we consider the following convex optimization problem:

min f (w),

w∈W

240

(9.1)

where W is a bounded convex domain. We assume that W can be characterized by an
inequality constraint and without loss of generality is bounded by the unit ball, i.e.,

W = {w ∈ Rd : g(w) ≤ 0} ⊆ B = {w ∈ Rd : ∥w∥2 ≤ 1},

(9.2)

where g(w) is a convex constraint function. We assume that W has a non-empty interior,
i.e., there exists w such that g(w) < 0 and the optimal solution w∗ to (9.1) is in the interior
of the unit ball B, i.e., ∥w∗ ∥2 < 1. Note that when a domain is characterized by multiple
convex constraint functions, say gi (w) ≤ 0, i = 1, . . . , m, we can summarize them into one
constraint g(w) ≤ 0, by defining it as g(w) = max1≤i≤m gi (w).
To solve the optimization problem in (9.1), we assume that the only information available
to the algorithm is through a stochastic oracle that provides unbiased estimates of the
gradient of f (w). More precisely, let ξ1 , . . . , ξT be a sequence of independently and identically
distributed (i.i.d) random variables sampled from an unknown distribution. At each iteration
t, given solution wt , the oracle returns g(wt ; ξt ), an unbiased estimate of the true gradient
∇f (wt ), i.e., Eξt [g(wt , ξt )] = ∇f (wt ). The goal of the learner is to find an approximate
optimal solution by making T calls to this oracle.
Recall from Chapter 2 that to find a solution within the domain W which optimizes
the given objective function f (w), SGD computes an unbiased estimate of the gradient of
f (w), and updates the solution by moving it in the opposite direction of the estimated
gradient. To ensure that the solution stays within the domain W, SGD has to project the
updated solution back into the W at every iteration. More precisely, SGD method produces

241

a sequence of solutions by following updating:

wt+1 = ΠW (wt − ηt g(wt , ξt )),

(9.3)

where ηt is the step size at iteration t, ΠW (·) is a projection operator that projects w into
the domain W, and g(w, ξt ) is an unbiased stochastic gradient of f (w), for which we further
assume bounded gradient variance as

Eξt [exp(∥g(w, ξt ) − ∇f (w)∥22 /σ 2 )] ≤ exp(1).

(9.4)

√
For general convex optimization, stochastic gradient descent methods can obtain an O(1/ T )
convergence rate in expectation or in a high probability provided (9.4) holds [115]. Although
a large number of iterations is usually needed to obtain a solution of desirable accuracy, the
lightweight computation per iteration makes SGD attractive for many large-scale learning
problems.
The SGD method is computationally eﬃcient only when the projection ΠW (·) can be
carried out eﬃciently. Although eﬃcient algorithms have been developed for projecting
solutions into special domains (e.g., simplex and ℓ1 ball [51, 100]); for complex domains, such
as a positive semidefinite (PSD) cone in metric learning and bounded trace norm matrices
in matrix completion (more examples of complex domains can be found in [73] and [77]),
the projection step requires solving an expensive convex optimization, leading to a high
computational cost per iteration and consequently making SGD unappealing for large-scale
optimization problems over such domains. For instance, projecting a matrix into a PSD cone
requires computing the full eigen-decomposition of the matrix, whose complexity is cubic in

242

the size of the matrix.
The central theme of this chapter in stochastic setting is to develop a SGD based method
that does not require projection at each iteration. This problem was first addressed in a
very recent work [73], where the authors extended Frank-Wolfe algorithm [56] for online
learning. But, one main shortcoming of the algorithm proposed in [73] is that it has a slower
convergence rate (i.e., O(T −1/3 )) than a standard SGD algorithm (i.e., O(T −1/2 )). In this
work, we demonstrate that a properly modified SGD algorithm can achieve the optimal
convergence rate of O(T −1/2 ) using only ONE projection for general stochastic convex
optimization problem.
We further develop an SGD based algorithm for strongly convex optimization that
achieves a convergence rate of O(ln T /T ), which is only a logarithmic factor worse than
the optimal rate [72]. The key idea of both algorithms is to appropriately penalize the
intermediate solutions when they are outside the domain. With an appropriate design of
penalization mechanism, the average solution wT obtained by the SGD after T iterations
will be very close to the domain W, even without intermediate projections. As a result, the
final feasible solution wT can be obtained by projecting wT into the domain W, the only
projection that is needed for the entire algorithm. We note that our approach is very diﬀerent
from the previous eﬀorts in developing projection free convex optimization algorithms (see
[73, 78, 77] and references therein), where the key idea is to develop appropriate updating
procedures to restore the feasibility of solutions at every iteration.

243

9.2

Stochastic Frank-Wolfe Algorithm

Before presenting the proposed algorithms, here we investigate and analyze a greedy algorithm, which is a slight modification of the Frank-Wolfe (FW) method, when applied to
stochastic optimization problem. As discussed in Chapter 2, the FW or conditional gradient
algorithm is a feasible direction method and at each iteration, it finds the best feasible direction (with respect to the linear approximation of the function) and updates the next solution
as a convex combination of the current solution and chosen feasible direction. This algorithm
is revisited in [77] for deterministic smooth convex function over a convex domain aiming to
devise an eﬃcient algorithm without projections. As discussed before, the algorithm in [77]
generates a sequence of solutions via the following steps:

ut = arg max ⟨u, −∇f (wt )⟩
u∈W

(9.5)

wt+1 = (1 − ηt )wt + ηt ut
The bulk of computation in FW algorithm is step (9.5) and the algorithm is an attractive
method whenever step (9.5) can be achieved with low cost complexity, for otherwise, it is
not practically interesting. This is true for some special domains, such as polyhedron, simplex and ℓ1 ball where the linear optimization problem can be eﬃciently solved. However,
this assumption does not hold for general complex domains (e.g. general PSD cones). This
limitation makes the FW algorithm computationally unattractive for general complex domains since it simply translates the complexity of quadratic optimization for projection into
a linear optimization problem over the same domain. We note that the FW algorithm is also
attractive because of its sparsity merits which is not the focus of this chapter and we only
consider the computational eﬃciency of optimization methods.
244

A trivial modification to extend the FW algorithm to stochastic gradients is to replace
the true gradient ∇f (wt ) with a stochastic gradient g(wt , ξt ) in (9.5). Next, we sketch an
analysis of stochastic greedy algorithm and discuss the problem caused by using unbiased
estimate of the gradient instead of true gradient in updating solutions.
A key inequality in the convergence analysis of stochastic greedy algorithm is

f (wt+1 ) ≤ f (wt ) − ηt max ⟨wt − w, ∇f (wt )⟩ +ηt2 L,
w∈W

(9.6)

≜ g(wt ,∇f (wt ))

where f (w) is assumed to be L-smooth function, g(w, ∇f (w))) = maxu∈W ⟨w − u, ∇f (w)⟩
is the duality gap. Using g(wt , ∇f (wt )) ≥ f (wt ) − f (w∗ ), we can obtain an O(1/T ) convergence bound for the greedy algorithm with ηt = 2/(t+1) by induction (e.g., see Theorem 2.21
in Chapter 2). However, when using the stochastic gradient, ut = arg maxu∈W ⟨u, −g(wt , ξt )⟩,
and the key inequality becomes as:

f (wt+1 ) ≤ f (wt ) − ηt ⟨wt − ut , ∇f (wt )⟩ + ηt2 L
≤ f (wt ) − ηt g(wt , ∇f (wt )) + ηt2 L
+ ηt (g(wt , ∇f (wt )) − g(wt , g(wt , ξt ))) + ηt ⟨wt − ut , g(wt , ξt ) − ∇f (wt )⟩
≜ ζt

where the top line recovers the inequality in (9.6). However, the problem is that the quantity
ζt in the second line of above inequality is not a martingale sequence, i.e., Eξ |ξ
[ζ ] ̸= 0.
t [t−1] t
We can take a conservative analysis to bound ζt ≤ Cηt ∥∇f (wt )−g(wt , ξt )∥∗ ≜ Cηt δt , where
C is a constant. Assuming ∥∇f (wt ) − g(wt , ξt )∥∗ ≤ σ, we could have, with a probably 1 − ϵ,
ηt δt ≤ ηt (1 + ln(1/ϵ))σ by Markov inequality in Lemma A.18. As a result, we obtain the
245

following recursive inequality

f (wt+1 ) − f (w∗ ) ≤ (1 − ηt )(f (wt ) − f (w∗ )) + ηt2 L + ηt C(1 + ln(1/ϵ))σ

where the last term in the above inequality makes the convergence analysis more involved.
Whether it is possible to design a sequence of step sizes ηt to have a vanishing bound for
f (wt ) − f (w∗ ) is unclear for us and remains an open problem, which is beyond the scope of
this chapter.

9.3

Stochastic Optimization with Single Projection

We now turn to extending the SGD method to the setting where only one projection is
allowed to perform for the entire sequence of updating. The main idea is to incorporate the
constraint function g(w) into the objective function to penalize the intermediate solutions
that are outside the domain. The result of the penalization is that, although the average
solution obtained by SGD may not be feasible, it should be very close to the boundary of the
domain. A projection is performed at the end of the iterations to restore the feasibility of
the average solution. Before proceeding, we recall few definitions from Appendix ?? about
convex analysis.
Definition 9.1. A function f (w) is a Lipschitz continuous function with constant G w.r.t
a norm ∥ · ∥, if

|f (w1 ) − f (w2 )| ≤ G∥w1 − w2 ∥, ∀w1 , w2 ∈ B.

(9.7)

In particular, a convex function f (w) with a bounded (sub)gradient ∥∂f (w)∥∗ ≤ G is
246

Algorithm 12 SGD with ONE Projection by Primal Dual Updating (SGD-PD)
1: Input: a sequence of step sizes {ηt }, and a parameter γ > 0
2: Initialize:: w1 = 0 and λ1 = 0
3: for t = 1, 2, . . . , T do
′
4:
Compute wt+1
= wt − ηt (g(wt , ξt ) + λt ∇g(wt ))
′ / max (∥w′ ∥ , 1),
5:
Update wt+1 = wt+1
t+1 2
6:
Update λt+1 = [(1 − γηt )λt + ηt g(wt )]+
7: end for
∑T
8: Output: wT = ΠW (wT ), where wT =
t=1 wt /T .

G-Lipschitz continuous, where ∥ · ∥∗ is the dual norm to ∥ · ∥.
Definition 9.2. A convex function f (w) is β-strongly convex w.r.t a norm ∥·∥ if there exists
a constant β > 0 (often called the modulus of strong convexity) such that, for any α ∈ [0, 1],
it holds:

1
f (αw1 + (1 − α)w2 ) ≤ αf (w1 ) + (1 − α)f (w2 ) − α(1 − α)β∥w1 − w2 ∥2 , ∀w1 , w2 ∈ B.
2

When f (w) is diﬀerentiable, the strong convexity is equivalent to

f (w1 ) ≥ f (w2 ) + ⟨∇f (w2 ), w1 − w2 ⟩ +

β
∥w − w2 ∥2 , ∀w1 , w2 ∈ B.
2 1

In the sequel, we use the standard Euclidean norm to define Lipschitz and strongly convex
functions.
The key ingredient of proposed algorithms is to replace the projection step with the
gradient computation of the constraint function defining the domain W, which is significantly
cheaper than projection step. As an example, when a solution is restricted to a PSD cone,
i.e., X ⪰ 0 where X is a symmetric matrix, the corresponding inequality constraint is
g(X) = λmax (−X) ≤ 0, where λmax (X) computes the largest eigenvalue of X and is a

247

convex function. In this case, ∇g(X) only requires computing the minimum eigenvector of
a matrix, which is cheaper than a full eigenspectrum computation required at each iteration
of the standard SGD algorithm to restore feasibility.
Below, we state a few assumptions about f (w) and g(w) often made in stochastic optimization as:
Assumption 9.3. We assume that:

A1

∥∇f (w)∥2 ≤ G1 ,

∥∇g(w)∥2 ≤ G2 ,

|g(w)| ≤ C2 ,

A2

Eξt [exp(∥g(w, ξt ) − ∇f (w)∥22 /σ 2 )] ≤ exp(1),

∀w ∈ B,
(9.8)

∀w ∈ B.

We also make the following mild assumption about the boundary of the convex domain
W as:
Assumption 9.4. We assume that:

A3

there exists a constant ρ > 0 such that

min ∥∇g(w)∥2 ≥ ρ.

(9.9)

g(w)=0

Remark 9.5. The purpose of introducing assumption A3 is to ensure that the optimal dual
variable for the constrained optimization problem in (9.1) is well bounded from the above, a
key factor for our analysis. To see this, we write the problem in (9.1) into a convex-concave
optimization problem:
min max f (w) + λg(w).

w∈B λ≥0

Let (w∗ , λ∗ ) be the optimal solution to the above convex-concave optimization problem. Since
we assume g(w) is strictly feasible, w∗ is also an optimal solution to (9.1) due to the
strong duality theorem [33]. Using the first order optimality condition, we have ∇f (w∗ ) =
248

−λ∗ ∇g(w∗ ). Hence, λ∗ = 0 when g(w∗ ) < 0, and λ∗ = ∥∇f (w∗ )∥2 /∥∇g(w∗ )∥2 when
g(w∗ ) = 0. Under assumption A3, we have λ∗ ∈ [0, G1 /ρ].
We note that, from a practical point of view, it is straightforward to verify that for many
domains including PSD cone and Polytope, the gradient of the constraint function is lower
bounded on the boundary and therefore assumption A3 does not limit the applicability of
the proposed algorithms for stochastic optimization. For the example of g(X) = λmax (−X),
the assumption A3 implies ming(X)=0 ∥∇g(X)∥F = ∥uu⊤ ∥F = 1, where u is an orthonomal
vector representing the corresponding eigenvector of the matrix X whose minimum eigenvalue is zero.
We propose two diﬀerent ways of incorporating the constraint function into the objective
function, which result in two algorithms, one for general convex and the other for strongly
convex functions.

9.3.1

General Convex Functions

To incorporate the constraint function g(w), we introduce a regularized Lagrangian function,

γ
L(w, λ) = f (w) + λg(w) − λ2 ,
2

λ ≥ 0.

The summation of the first two terms in L(w, λ) corresponds to the Lagrangian function in
dual analysis and λ corresponds to a Lagrangian multiplier. A regularization term −(γ/2)λ2
is introduced in L(w, λ) to prevent λ from being too large. Instead of solving the constrained
optimization problem in (9.1), we try to solve the following convex-concave optimization

249

problem

min max L(w, λ).

w∈B λ≥0

(9.10)

The proposed algorithm for stochastically optimizing the problem in (9.10) is summarized
in Algorithm 12. It diﬀers from the existing stochastic gradient descent methods in that
it updates both the primal variable w (steps 4 and 5) and the dual variable λ (step 6),
which shares the same step sizes. We note that the parameter ρ is not employed in the
implementation of Algorithm 12 and is only required for the theoretical analysis. It is
noticeable that a similar primal-dual updating will be explored later in this chapter to avoid
projection in online learning. We note that in online setting the algorithm and analysis only
lead to a bound for the regret and the violation of the constraints in a long run, which does
not necessarily guarantee the feasibility of final solution. Also our proof techniques diﬀer
from [115], where the convergence rate is obtained for the saddle point; however our goal is
to attain bound on the convergence of the primal feasible solution.
Remark 9.6. The convex-concave optimization problem in (9.10) is equivalent to the following minimization problem:

min f (w) +

w∈B

[g(w)]2+
,
2γ

(9.11)

where [z]+ outputs z if z > 0 and zero otherwise. It thus may seem tempting to directly
optimize the penalized function f (w) + [g(w)]2+ /(2γ) using the standard SGD method, which
√
unfortunately does not yield a regret of O( T ). This is because, in order to obtain a regret
√
√
of O( T ), we need to set γ = Ω( T ), which unfortunately will lead to a blowup of the

250

gradients and consequently a poor regret bound. Using a primal-dual updating schema allows
√
us to adjust the penalization term more carefully to obtain an O(1/ T ) convergence rate.

Theorem 9.7. For any general convex function f (w), if we set ηt = γ/(2G22 ), t = 1, · · · , T ,
√
2
and γ = G2 / (G21 + C22 + (1 + ln(2/δ))σ 2 )T in Algorithm 12, under assumptions A1-A3,
we have, with a probability at least 1 − δ,
(
f (wT ) ≤ min f (w) + O
w∈W

1
√
T

)
,

where O(·) suppresses polynomial factors that depend on ln(2/δ), G1 , G2 , C2 , ρ, and σ.

9.3.2

Strongly Convex Functions with Smoothing

We first emphasize that it is diﬃcult to extend Algorithm 12 to achieve an O(ln T /T ) convergence rate for strongly convex optimization. This is because although the function −L(w, λ)
is strongly convex in λ, its modulus for strong convexity is γ, which is too small to obtain
an O(ln T ) regret bound.
To achieve a faster convergence rate for strongly convex optimization, we change assumptions A1 and A2 as follows.
Assumption 9.8.

A4

∥g(w, ξt )∥2 ≤ G1 ,

∥∇g(w)∥2 ≤ G2 ,

∀w ∈ B,

where we slightly abuse the same notation G1 .
Note that A1 only requires that ∥∇f (w)∥2 is bounded and A2 assumes a mild condition
251

on the stochastic gradient. In contrast, for strongly convex optimization we need to assume
a bound on the stochastic gradient ∥g(w, ξt )∥2 . Although assumption A4 is stronger than
assumptions A1 and A2, however, it is always possible to bound the stochastic gradient for
machine learning problems where f (w) usually consists of a summation of loss functions on
training examples, and the stochastic gradient is computed by sampling over the training
examples. Given the bound on ∥g(w, ξt )∥2 , we can easily have ∥∇f (w)∥2 = ∥Eg(w, ξt )∥2 ≤
E∥g(w, ξt )∥2 ≤ G1 , which is used to set an input parameter λ0 > G1 /ρ to the algorithm.
According to the discussion in the last subsection, we know that the optimal dual variable
λ∗ is upper bounded by G1 /ρ, and consequently is upper bounded by λ0 .
Similar to the last approach, we write the optimization problem (9.1) into an equivalent
convex-concave optimization problem:

min f (w) = min max f (w) + λg(w) = min f (w) + λ0 [g(w)]+ .

g(w)≤0

w∈B 0≤λ≤λ0

w∈B

To avoid unnecessary complication due to the subgradient of [·]+ , following [123], we introduce a smoothing term H(λ/λ0 ), where H(p) = −p ln p − (1 − p) ln(1 − p) is the entropy
function, into the Lagrangian function, leading to the optimization problem min F (w), where
w∈B

F (w) is defined as
(
(
))
λ0 g(w)
,
F (w) = f (w) + max λg(w) + γH(λ/λ0 ) = f (w) + γ ln 1 + exp
γ
0≤λ≤λ0
where γ > 0 is a parameter whose value will be determined later. Given the smoothed
objective function F (w), we find the optimal solution by applying SGD to minimize F (w),

252

Algorithm 13 SGD with ONE Projection by a Smoothing Technique (SGD-ST)
1: Input: a sequence of step sizes {ηt }, λ0 , and γ
2: Initialize: w1 = 0.
3: for t = 1, . . . , T do
4:
Compute
′
wt+1
= wt − ηt

(
g(wt , ξt ) +

exp (λ0 g(wt )/γ)
λ ∇g(wt )
1 + exp(λ0 g(wt )/γ) 0

)

′ / max(∥w′ ∥ , 1)
5:
Update wt+1 = wt+1
t+1 2
6: end for
∑T
7: Return: wT = ΠW (wT ), where wT =
t=1 wt /T .

where the gradient of F (w) is computed by

∇F (w) = ∇f (w) +

exp (λ0 g(w)/γ)
λ ∇g(w).
1 + exp (λ0 g(w)/γ) 0

(9.12)

Algorithm 13 gives the detailed steps. Unlike Algorithm 12, only the primal variable w is
updated in each iteration using the stochastic gradient computed in (9.12). The following
theorem shows that Algorithm 13 achieves an O(ln T /T ) convergence rate if the cost functions
are strongly convex.
Theorem 9.9. For any β-strongly convex function f (w), if we set ηt = 1/(2βt), t = 1, . . . , T ,
γ = ln T /T , and λ0 > G1 /ρ in Algorithm 13, under assumptions A3 and A4, we have with
a probability at least 1 − δ,
(
f (wT ) ≤ min f (w) + O
w∈W

ln T
T

)
,

where O(·) suppresses polynomial factors that depend on ln(1/δ), 1/β, G1 , G2 , ρ, and λ0 .
It is well known that the optimal convergence rate of SGD for strongly convex optimization is O(1/T ) [72] which has been proven to be tight in stochastic optimization setting [4].
253

According to Theorem 9.9, Algorithm 13 achieves an almost optimal convergence rate except for the factor of ln T . It is worth mentioning that although it is not explicitly given
in Theorem 9.9, the detailed expression for the convergence rate of Algorithm 13 exhibits
a tradeoﬀ in setting λ0 (more can be found in the proof of Theorem 9.9). Finally, under
√
assumptions A1-A3, Algorithm 13 can achieve an O(1/ T ) convergence rate for general
convex functions, similar to Algorithm 12.

9.4

Analysis of Convergence Rate

We here present the proofs of main theorems. The omitted proofs of some results are deferred
to Section 9.6.

9.4.1

Convergence Rate for General Convex Functions

To pave the path for the proof of of Theorem 9.7, we present a series of lemmas. The
lemma below states two key inequalities for the proof, which follows the standard analysis
of gradient descent.
Lemma 9.10. Under the bounded assumptions in (9.8) and (9.8), for any w ∈ B and λ > 0,
we have

⟨wt − w, ∇w L(wt , λt )⟩ ≤

)
1 (
∥w − wt ∥22 − ∥w − wt+1 ∥22 + 2ηt G21 + ηt G22 λ2t
2ηt
+ 2ηt ∥g(wt , ξt ) − ∇f (wt )∥22 + ⟨w − wt , g(wt , ξt ) − ∇f (wt )⟩,
≡∆t

)
1 (
2
2
(λ − λt )∇λ L(wt , λt ) ≤
|λ − λt | − |λ − λt+1 | + 2ηt C22 .
2ηt
254

≡ζt (w)

An immediate result of Lemma 9.10 is the following which states a regret-type bound.
Lemma 9.11. For any general convex function f (w), if we set ηt = γ/(2G22 ), t = 1, · · · , T ,
we have

T
∑

(f (wt ) − f (w∗ )) +

t=1
T
T
∑
G22 (G21 + C22 )
γ ∑
≤
+
γT + 2
∆t +
ζt (w∗ ),
γ
G22
G2 t=1
t=1

∑
[ Tt=1 g(wt )]2+
2(γT + 2G22 /γ)
(9.13)

where w∗ = arg minw∈W f (w).
Proof. (Proof of Theorem 9.7) First, by martingale inequality (e.g., Lemma 4 in [94]), with
√
√
∑
a probability 1 − δ/2, we have Tt=1 ζt (w∗ ) ≤ 2σ 3 ln(2/δ) T . By Markov’s inequality
(Lemma A.18 in Appendix A.2), with a probability 1 − δ/2, we have

∑T

t=1 ∆t ≤ (1 +

ln(2/δ))σ 2 T . Substituting these inequalities into Lemma 9.11, plugging the stated value of
γ, and using O(·) notation for ease of exposition, we have with a probability 1 − δ
T
∑

T
√
]2
1 [∑
(f (wt ) − f (w∗ )) + √
g(wt ) + ≤ O( T ),
C T t=1
t=1

√
√
2
2
2
where C = 2G2 (1/ G1 + C2 + (1 + ln(2/δ))σ + 2 G21 + C22 + (1 + ln(2/δ))σ 2 ) and O(·)
suppresses polynomial factors that depend on ln(2/δ), G1 , G2 , C2 , σ.
Recalling the definition of wT =

∑T

t=1 wt /T and using the convexity of f (w) and g(w),

we have

√
(
)
1
T
2
f (wT ) − f (w∗ ) +
[g(wT )]+ ≤ O √
.
C
T
255

(9.14)

Assume g(wT ) > 0, otherwise wT = wT and we easily have f (wT ) ≤ minw∈W f (w) +
√
O(1/ T ). Since wT is the projection of wT into W, i.e., wT = arg ming(w)≤0 ∥w − wT ∥22 ,
then by first order optimality condition, there exists a positive constant s > 0 such that

g(wT ) = 0, and wT − wT = s∇g(wT )

˜ T ). Hence,
which indicates that wT − wT is in the same direction to ∇g(w
g(wT ) = g(wT ) − g(wT ) ≥ (wT − wT )⊤ ∇g(wT )
= ∥wT − wT ∥2 ∥∇g(wT )∥2

(9.15)

≥ ρ∥wT − wT ∥2 ,
where the last inequality follows the definition of ming(w)=0 ∥∇g(w)∥2 ≥ ρ. Additionally,
we have

f (w∗ ) − f (wT ) ≤ f (w∗ ) − f (wT ) + f (wT ) − f (wT ) ≤ G1 ∥wT − wT ∥2 ,

(9.16)

due to f (w∗ ) ≤ f (wT ) and Lipschitz continuity of f (w). Combining inequalities (9.14), (9.15),
and (9.16) yields
√
ρ2 √
T ∥wT − wT ∥22 ≤ O(1/ T ) + G1 ∥wT − wT ∥2 .
C
G C
By simple algebra, we have ∥wT − wT ∥2 ≤ 21√ + O
ρ T

256

(√

)
C
ρ2 T

Therefore

f (wT ) ≤ f (wT ) − f (wT ) + f (wT )
≤ G1 ∥wT − wT ∥2 + f (w∗ ) + O
(
)
1
≤ f (w∗ ) + O √
,
T

(

1
√
T

)
(9.17)

where we use the inequality in (9.14) to bound f (wT ) by f (w∗ ) and absorb the dependence on ρ, G1 , C into the O(·) notation.
Remark 9.12. From the proof of Theorem 9.7, we can see that the key inequalities are (9.14), (9.15),
and (9.16). In particular, the regret-type bound in (9.14) depends on the algorithm. If we
only update the primal variable using the penalized objective in (9.11), whose gradient depends on 1/γ, it will cause a blowup in the regret bound with (1/γ + γT + T /γ), which leads
to a non-convergent bound.

9.4.2

Convergence Rate for Strongly Convex Functions

Our proof of Theorem 9.9 for the convergence rate of Algorithm 13 when applied to strongly
convex functions starts with the following lemma by analogy of Lemma 9.11.
Lemma 9.13. For any β-strongly convex function f (w), if we set ηt = 1/(2βt), we have

T
T
T
∑
(G21 + λ20 G22 )(1 + ln T ) ∑
β∑
+
ζt (w∗ ) −
∥w∗ − wt ∥22
(F (w) − F (w∗ )) ≤
2β
4
t=1

t=1

t=1

where w∗ = arg minw∈W f (w).
In order to prove Theorem 9.9, we need the following result for an improved martingale
inequality.
257

Lemma 9.14. For any fixed w ∈ B, define DT =

∑T

∑T
2
t=1 ∥wt − w∥2 , ΛT =
t=1 ζt (w), and

m = ⌈log2 T ⌉. We have
√
(
)
(
)
4
m
m
Pr DT ≤
+ Pr ΛT ≤ 4G1 DT ln + 4G1 ln
≥ 1 − δ.
T
δ
δ

Proof of Theorem 9.9. We substitute the bound in Lemma 9.14 into the inequality in Lemma 9.13
with w = w∗ . We consider two cases. In the first case, we assume DT ≤ 4/T . As a result,
we have
T
∑

ζt (w∗ ) =

t=1

T
∑

(∇f (wt ) − g(wt , ξt ))⊤ (w∗ − wt ) ≤ 2G1

√

T DT ≤ 4G1 ,

t=1

which together with the inequality in Lemma 9.13 leads to the bound
T
∑

(F (wt ) − F (w∗ )) ≤ 4G1 +

t=1

(G21 + λ20 G22 )(1 + ln T )
.
2β

In the second case, we assume
T
∑

√
ζt (w∗ ) ≤ 4G1

t=1

m
m
β
DT ln + 4G1 ln
≤ DT +
δ
δ
4

(

)
16G21
m
+ 4G1 ln ,
β
δ

√
where the last step uses the fact 2 ab ≤ a2 + b2 . We thus have
T
∑
t=1

(
(F (wt ) − F (w∗ )) ≤

16G21
+ 4G1
β

)
ln

m (G21 + λ20 G22 )(1 + ln T )
+
.
δ
2β

Combing the results of the two cases, we have, with a probability 1 − δ,

258

T
∑
t=1

(
(F (wt ) − F (w∗ )) ≤

16G21
+ 4G1
β

)
ln

(G2 + λ20 G22 )(1 + ln T )
m
.
+ 4G1 + 1
δ
2β

(9.18)

O(ln T )

By convexity of F (w), we have F (wT ) ≤ F (w∗ ) + O (ln T /T ). Noting that w∗ ∈ W,
g(w∗ ) ≤ 0, we have F (w∗ ) ≤ f (w∗ ) + γ ln 2. On the other hand,
(
(
))
λ0 g(wT )
F (wT ) = f (wT ) + γ ln 1 + exp
≥ f (wT ) + max (0, λ0 g(wT )) .
γ

Therefore, with the value of γ = ln T /T , we have

(
f (wT ) ≤ f (w∗ ) + O

ln T
T

)
,

f (wT ) + λ0 g(wT ) ≤ f (w∗ ) + O

(

ln T
T

)

(9.19)
.

Applying the inequalities (9.15) and (9.16) to (9.19), and noting that γ = ln T /T , we
have

(
λ0 ρ∥wT − wT ∥2 ≤ G1 ∥wT − wT ∥2 + O

ln T
T

)
.

For λ0 > G1 /ρ, we have ∥wT − wT ∥2 ≤ (1/(λ0 ρ − G1 ))O(ln T /T ). Therefore
(
f (wT ) ≤ f (wT )−f (wT )+f (wT ) ≤ G1 ∥wT −wT ∥2 +f (w∗ )+O

where in the second inequality we use inequality (9.19).

259

ln T
T

)

(
≤ f (w∗ )+O

ln T
T

)
,

9.5

Online Optimization with Soft Constraints

We now turn to making online learning algorithms more eﬃcient by excluding the projection
steps. To this end, we consider an alternative online learning problem in which, instead of
requiring that each solution obeys the constraints, we only require the constraints, which
define the convex domain W, to be satisfied in a long run. Then, the online learning problem
becomes a task to find a sequence of solutions under the long term constraints. We present
and analyze an online gradient descent based algorithm for online convex optimization with
soft constraints.
To facilitate our analysis, similar to the stochastic setting described in Section 9.1, we
assume that the domain W can be written as an intersection of a finite number of convex
constraints, that is,
W = {w ∈ Rd : gi (w) ≤ 0, i ∈ [m]},
where gi (w), i ∈ [m], are Lipschitz continuous functions. We assume that W is a bounded
domain, that is, there exist constants R > 0 and r < 1 such that W ⊆ RB and rB ⊆ W
where B denotes the unit ℓ2 ball centered at the origin.
We focus on the problem of online convex optimization as introduced in Chapter 2, in
which the goal is to achieve a low regret with respect to a fixed decision on a sequence of
adversarially chosen cost functions. The diﬀerence between the setting considered here and
the general online convex optimization is that, in our setting, instead of requiring wt ∈ W, or
equivalently gi (wt ) ≤ 0, i ∈ [m], for all t ∈ [T ], we only require the constraints to be satisfied
in the long run, namely

∑T

t=1 gi (wt ) ≤ 0, i ∈ [m]. Then, the problem becomes to find a

sequence of solutions wt , t ∈ [T ] that minimizes the regret, under the long term constraints
∑T

t=1 gi (wt ) ≤ 0, i ∈ [m].

Formally, we would like to solve the following optimization
260

problem online,

min

T
∑

w1 ,...,wT ∈B
t=1

ft (wt ) − min

w∈W

T
∑

ft (w) s.t.

t=1

T
∑

gi (wt ) ≤ 0 , i ∈ [m].

(9.20)

t=1

For simplicity, we will focus on a finite-horizon setting where the number of rounds T is
known in advance. This condition can be relaxed under certain conditions, using standard
techniques (see, e.g., [38]). Note that in (9.20), (i) the solutions come from the ball B ⊇ W
instead of W and (ii) the constraint functions are fixed and are given in advance.

9.5.1

An Impossibility Theorem

Before we state our formulation and algorithms, let us review a few alternative techniques
that do not need explicit projection. A straightforward approach is to introduce an appropriate self-concordant barrier function for the given convex set W and add it to the objective
function such that the barrier diverges at the boundary of the set. Then we can interpret
the resulting optimization problem, on the modified objective functions, as an unconstrained
minimization problem that can be solved without projection steps. Following the analysis
in [3], with an appropriately designed procedure for updating solutions, we could guarantee
√
a regret bound of O( T ) without the violation of constraints. A similar idea is used in [2]
for online bandit learning and in [114] for a random walk approach for regret minimization
which, in fact, translates the issue of projection into the diﬃculty of sampling. Even for
linear Lipschitz cost functions, the random walk approach requires sampling from a Gaussian distribution with covariance given by the Hessian of the self-concordant barrier of the
convex set W that has the same time complexity as inverting a matrix. The main limitation
with these approaches is that they require computing the Hessian matrix of the objective
261

function in order to guarantee that the updated solution stays within the given domain W.
This limitation makes it computationally unattractive when dealing with high dimensional
data. In addition, except for well known cases, it is often unclear how to eﬃciently construct
a self-concordant barrier function for a general convex domain.
An alternative approach for online convex optimization with long term constraints is to
introduce a penalty term in the loss function that penalizes the violation of constraints.
More specifically, we can define a new loss function fˆt (w) as

fˆt (w) = ft (w) + δ

m
∑

[gi (w)]+ ,

(9.21)

i=1

where [z]+ = max(0, 1−z) and δ > 0 is a fixed positive constant used to penalize the violation
of constraints. We then run the standard OGD algorithm from Chapter 2 to minimize the
modified loss function fˆt (w). The following theorem shows that this simple strategy fails to
achieve sub-linear bound for both regret and the long term violation of constraints at the
same time.
Theorem 9.15. Given δ > 0, there always exists a sequence of loss functions f1 , f2 , · · · , fT
∑T
∑T
and a constraint function g(w) such that either
f
(w
)
−
min
t
t
g(w)≤0
t=1
t=1 ft (w) =
O(T ) or

∑T

t=1 [g(wt )]+ = O(T ) holds, where w1 , w2 , · · · , wT is the sequence of solutions

generated by the OGD algorithm that minimizes the modified loss functions given in (9.21).
Proof. We first show that when δ < 1, there exists a loss function and a constraint function
such that the violation of constraint is linear in T . To see this, we set ft (w) = w⊤ w, t ∈ [T ]
and g(w) = 1 − w⊤ w. Assume we start with an infeasible solution, that is, g(w1 ) > 0
or w1⊤ w < 1. Given the solution wt obtained at tth trial, using the standard gradient
descent approach, we have wt+1 = wt − η(1 − δ)w. Hence, if wt⊤ w < 1, since we have
262

⊤ w < w⊤ w < 1, if we start with an infeasible solution, all the solutions obtained over
wt+1
t

the trails will violate the constraint g(w) ≤ 0, leading to a linear number of violation of
constraints. Based on this analysis, we assume δ > 1 in the analysis below.
Given a strongly convex loss function f (w) with modulus γ, we consider a constrained
optimization problem given by

min f (w),
g(w)≤0

which is equivalent to the following unconstrained optimization problem

min f (w) + λ[g(w)]+ ,
w

where λ ≥ 0 is the Lagrangian multiplier. Since we can always scale f (w) to make λ ≤ 1/2,
it is safe to assume λ ≤ 1/2 < δ. Let w∗ and wa be the optimal solutions to the constrained
optimization problems arg ming(w)≤0 f (w) and arg min f (w) + δ[g(w)]+ , respectively. We
w

choose f (w) such that ∥∇f (w∗ )∥ > 0, which leads to wa ̸= w∗ . This holds because
according to the first order optimality condition, we have

∇f (w∗ ) = −λ∇g(w∗ ), ∇f (wa ) = −δ∇g(w∗ ),

and therefore ∇f (w∗ ) ̸= ∇f (wa ) when λ < δ. Define ∆ = f (wa ) − f (w∗ ). Since ∆ ≥
γ∥wa − w∗ ∥2 /2 due to the strong convexity of f (w), we have ∆ > 0.
Let {wt }Tt=1 be the sequence of solutions generated by the OGD algorithm that minimizes

263

the modified loss function f (w) + δ[g(w)]+ . We have
T
∑

f (wt ) + δ[g(wt )]+ ≥ T min f (w) + δ[g(w)]+
w

t=1

= T (f (wa ) + δ[g(wa )]+ ) ≥ T (f (wa ) + λ[g(wa )]+ )
= T (f (w∗ ) + λ[g(w∗ )]+ ) + T (f (wa ) + λ[g(wa )]+ − f (w∗ ) − λ[g(w∗ )])
≥ T min f (w) + T ∆.
g(w)≤0

As a result, we have
T
∑

f (wt ) + δ[g(wt )]+ − min f (w) = O(T ),
g(w)≤0

t=1

implying that either the regret

∑T

t=1 f (wt ) − T f (w∗ ) or the violation of the constraints

∑T

t=1 [g(w)]+ is linear in T .

To better understand the performance of penalty based approach, here we analyze the
performance of the OGD in solving the online optimization problem in (9.20). The algorithm
is analyzed using the following lemma from [158] (see also Theorem 2.8 in Chapter 2).
Lemma 9.16. Let w1 , w2 , . . . , wT ∈ W be the sequence of solutions obtained by applying
OGD on the sequence of bounded convex functions f1 , f2 , . . . , fT . Then, for any solution
w∗ ∈ W we have
T
∑
t=1

ft (wt ) −

T
∑
t=1

R2 η ∑
+
ft (w∗ ) ≤
∥∇ft (wt )∥2 .
2η
2
T

t=1

We apply OGD to functions fˆt (w), t ∈ [T ] defined in (9.21), that is, instead of updating
the solution based on the gradient of ft (w), we update the solution by the gradient of fˆt (w).
264

Using Lemma 9.16, by expanding the functions fˆt (w) based on (9.21) and considering the
∑
2
fact that m
i=1 [gi (w∗ )]+ = 0, we get
T
∑

ft (wt ) −

t=1

T
∑
t=1

δ ∑∑
R2 η ∑
[gi (w)]2+ ≤
+
∥∇fˆt (wt )∥2 .
ft (w∗ ) +
2
2η
2
T

m

T

t=1 i=1

(9.22)

t=1

From the definition of fˆt (w), the norm of the gradient ∇fˆt (wt ) is bounded as follows

∥∇fˆt (w)∥2 = ∥∇ft (w) + δ

m
∑

[gi (w)]+ ∇gi (w)∥2 ≤ 2G2 (1 + mδ 2 D2 ),

(9.23)

i=1

where the inequality holds because (a1 + a2 )2 ≤ 2(a21 + a22 ). By substituting (9.23) into the
(9.22) we have:
T
∑

ft (wt ) −

t=1

T
∑

δ ∑∑
R2
[gi (wt )]2+ ≤
+ ηG2 (1 + mδ 2 D2 )T.
2
2η
T

ft (w∗ ) +

t=1

m

(9.24)

t=1 i=1

Since [·]2+ is a convex function, from Jensen’s inequality and following the fact that

∑T

t=1 ft (wt )−

ft (w∗ ) ≥ −F T , we have:

2
T
m ∑
m T
∑
δ ∑∑
δ
R2
2


gi (wt )
≤
[gi (wt )]+ ≤
+ ηG2 (1 + mδ 2 D2 )T + F T.
2T
2
2η
i=1

t=1

i=1 t=1

+

By minimizing the right hand side of (9.24) with respect to η, we get the regret bound as
T
∑
t=1

ft (wt ) −

T
∑

ft (w∗ ) ≤ RG

√

√
2(1 + mδ 2 D2 )T = O(δ T )

t=1

265

(9.25)

and the bound for the violation of constraints as
T
∑
t=1

√(
)
2T
R2
2
2
2
gi (wt ) ≤
+ ηG (1 + mδ D )T + F T
= O(T 1/4 δ 1/2 + T δ −1/2 ). (9.26)
2η
δ

By examining the bounds obtained in (9.25) and (9.26), it turns out that in order to recover
√
O( T ) regret bound, we need to set δ to be a constant, leading to O(T ) bound for the
violation of constraints in the long run, which is not satisfactory at all. The analysis shows
√
that in order to obtain O( T ) regret bound, linear bound on the long term violation of the
constraints is unavoidable. The main reason for the failure of using modified loss function in
(9.21) is that the weight constant δ is fixed and independent from the sequence of solutions
obtained so far. In the next subsection, we present an online convex-concave formulation
for online convex optimization with long term constraints, which explicitly addresses the
limitation of (9.21) by automatically adjusting the weight constant based on the violation
of the solutions obtained so far.
As mentioned before, our general strategy is to turn online convex optimization with
long term constraints into a convex-concave optimization problem. Instead of generating
a sequence of solutions that satisfies the long term constraints, we first consider an online
optimization strategy that allows the violation of constraints on some rounds in a controlled
way. We then modify the online optimization strategy to obtain a sequence of solutions
that obeys the long term constraints. Although the online convex optimization with long
term constraints is clearly easier than the standard online convex optimization problem, it
is straightforward to see that optimal regret bound for online optimization with long term
√
constraints should be on the order of O( T ), no better than the standard online convex
optimization problem.

266

9.5.2

√
An Eﬃcient Algorithm with O( T ) Regret Bound and O(T 3/4 )
Bound on the Violation of Constraints

The intuition behind our approach stems from the observation that the constrained optimization problem minw∈W

∑T

t=1 ft (w) is equivalent to the following convex-concave optimization

problem

min max

w∈B λ∈Rm
+

T
∑

ft (w) +

t=1

m
∑

λi gi (w),

(9.27)

i=1

where λ = (λ1 , . . . , λm )⊤ is the vector of Lagrangian multipliers associated with the constraints gi (·), i = 1, . . . , m and belongs to the nonnegative orthant Rm
+ . To solve the online
convex-concave optimization problem, we extend the gradient based approach for variational
inequality [116] to (9.27). To this end, we consider the following regularized convex-concave
function as

}
m {
∑
δη 2
Lt (w, λ) = ft (w) +
λi gi (w) − λi ,
2

(9.28)

i=1

where δ > 0 is a constant whose value will be decided by the analysis. Note that in (9.28), we
introduce a regularizer δηλ2i /2 to prevent λi from being too large. This is because, when λi
∑
is large, we may encounter a large gradient for w because of ∇w Lt (w, λ) ∝ m
i=1 λi ∇gi (w),
leading to unstable solutions and a poor regret bound. Although we can achieve the same goal
by restricting λi to a bounded domain, using the quadratic regularizer makes it convenient
for our analysis.
267

Algorithm 14 shows the detailed steps of the proposed algorithm. Unlike standard online
convex optimization algorithms that only update w, Algorithm 14 updates both w and λ.
In addition, unlike the modified loss function in (9.21) where the weights for constraints
m
{gi (w) ≤ 0}m
i=1 are fixed, Algorithm 14 automatically adjusts the weights {λi }i=1 based on

{gi (w)}m
i=1 , the violation of constraints, as the game proceeds. It is this property that allows
Algorithm 14 to achieve sub-linear bound for both regret and the violation of constraints.
To analyze Algorithm 14, we first state the following lemma, the key to the main theorem
on the regret bound and the violation of constraints.
Lemma 9.17. Let Lt (·, ·) be the function defined in (9.28) which is convex in its first argument and concave in its second argument. Then for any (w, λ) ∈ B × Rm
+ we have

Lt (wt , λ) − Lt (w, λt ) ≤

1
(∥w − wt ∥2 + ∥λ − λt ∥2 − ∥w − wt+1 ∥2 − ∥λ − λt+1 ∥2 )
2η
η
+ (∥∇w Lt (wt , λt )∥2 + ∥∇λ Lt (wt , λt )∥2 ).
2

Proof. Following the analysis of [158], convexity of Lt (·, λ) implies that

Lt (wt , λt ) − Lt (w, λt ) ≤ ⟨wt − w, ∇w Lt (wt , λt )⟩

(9.29)

and by concavity of Lt (w, ·) we have

Lt (wt , λ) − Lt (wt , λt ) ≤ ⟨λ − λt , ∇λ Lt (wt , λt )⟩.

(9.30)

Combining the inequalities (9.29) and (9.30) results in

Lt (wt , λ) − Lt (w, λt ) ≤ ⟨wt − w, ∇w Lt (wt , λt )⟩ − ⟨λ − λt , ∇λ Lt (wt , λt )⟩.
268

(9.31)

Algorithm 14 Online Gradient Descent with Soft Constraints
1: Input:
• constraints gi (w) ≤ 0, i ∈ [m]
• step size η
• constant δ > 0
2: Initialize: w1 = 0 and λ1 = 0
3: for t = 1, 2, . . . , T do
4:
Submit solution wt
5:
Receive the convex function ft (·) and suﬀer loss ft (wt )
6:
Compute the gradients as:

∇w Lt (wt , λt ) = ∇ft (wt ) +

m
∑

λit ∇gi (wt )

i=1
∇λi Lt (wt , λt ) = gi (wt ) − ηδλit
7:

Update wt and λt by:
wt+1 = ΠB (wt − η∇w Lt (wt , λt ))
λt+1 = Π[0,+∞)m (λt + η∇λ Lt (wt , λt ))

8: end for

Using the update rule for wt+1 in terms of wt and expanding, we get

∥w − wt+1 ∥2 ≤ ∥w − wt ∥2 − 2η⟨wt − w, ∇w Lt (wt , λt )⟩ + η 2 ∥∇w Lt (wt , λt )∥2 , (9.32)

where the first inequality follows from the nonexpansive property of the projection operation
(see Lemma A.5). Expanding the inequality for ∥λ − λt+1 ∥2 in terms of λt and plugging
back into the (9.31) with (9.32) establishes the desired inequality.

Proposition 9.18. Let wt and λt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 14. Then for any w ∈ B and λ ∈ Rm
+ , we have

269

T
∑

Lt (wt , λ) − Lt (w, λt )

t=1
T
) η(
)∑
R2 + ∥λ∥2 ηT (
2
2
2
2
2
≤
+
(m + 1)G + 2mD +
(m + 1)G + 2mδ η
∥λt ∥2 .
2η
2
2
t=1

Proof. We first bound the gradient terms in the right hand side of Lemma 9.17. Using
the inequality (a1 + a2 + . . . , an )2 ≤ n(a21 + a22 + . . . + a2n ), we have ∥∇w Lt (wt , λt )∥2 ≤
(
)
(m + 1)G2 1 + ∥λt ∥2 and ∥∇λ Lt (wt , λt )∥2 ≤ 2m(D2 + δ 2 η 2 ∥λt ∥2 ). In Lemma 9.17,
by adding the inequalities of all iterations, and using the fact ∥w∥ ≤ R we complete the
proof.
The following theorem bounds the regret and the violation of the constraints in the long
run for Algorithm 14.
√
√
Theorem 9.19. Define a = R (m + 1)G2 + 2mD2 . Set η = R2 /[a T ]. Assume T is
√
large enough such that 2 2η(m + 1) ≤ 1. Choose δ such that δ ≥ (m + 1)G2 + 2mδ 2 η 2 . Let
w1 , w2 , · · · , wT be the sequence of solutions obtained by Algorithm 14. Then for the optimal
∑
solution w∗ = minw∈W Tt=1 ft (w) we have
T
∑
t=1
T
∑

√
ft (wt ) − ft (w∗ ) ≤ a T = O(T 1/2 ), and
√
gi (wt ) ≤

(

√ )√
2 FT + a T
T

t=1

(

δR2 ma
+ 2
a
R

)
= O(T 3/4 ).

Proof. We begin by expanding (9.33) using (9.28) and rearranging the terms to get

270

T
∑



[ft (wt ) − ft (w)] +

t=1

m 
∑

T
∑

i=1

t=1

λ
 i

gi (wt ) −

T
∑
t=1


 δηT
i
∥λ∥2
λt gi (w) −

2

T
)
δη ∑
R2 + ∥λ∥2 ηT (
≤−
∥λt ∥2 +
+
(m + 1)G2 + 2mD2
2
2η
2
t=1

T
)∑
η(
2
2
2
+
(m + 1)G + 2mδ η
∥λt ∥2 .
2
t=1

Since δ ≥ (m + 1)G2 + 2mδ 2 η 2 , we can drop the ∥λt ∥2 terms from both sides of the
above inequality and obtain
T
∑

[ft (wt ) − ft (w)] +

t=1


m 
∑
i=1

≤



m ∑
T
∑

λi

T
∑

(
gi (wt ) −

t=1

λit gi (w) +

i=1 t=1

δηT
m
+
2
2η

)
λ2i





)
R2 ηT (
+
(m + 1)G2 + 2mD2 ) .
2η
2

The left hand side of above inequality consists of two terms. The first term basically measures the diﬀerence between the cumulative loss of the Algorithm 14 and the optimal solution
and the second term includes the constraint functions with corresponding Lagrangian multipliers which will be used to bound the long term violation of the constraints. By taking
maximization for λ over the range (0, +∞), we get


T
T

∑
∑
+
i
−
λt gi (w)
[ft (wt ) − ft (w)] +
 2(δηT + m/η)


t=1
t=1
i=1 
)
R2 ηT (
2
2
+
(m + 1)G + 2mD ) .
≤
2η
2
 [∑
]2
T
m 
g
(w
)

t
∑
i
t=1

271

Since w∗ ∈ W, we have gi (w∗ ) ≤ 0, i ∈ [m], and the resulting inequality becomes
T
∑

ft (wt ) − ft (w∗ ) +

t=1

m
∑
i=1

[∑

]2
T
g
(w
)
t=1 i t +

)
R2 ηT (
2
2
≤
+
(m + 1)G + 2mD ) .
2(δηT + m/η)
2η
2

The statement of the first part of the theorem follows by using the expression for η. The
second part is proved by substituting the regret bound by its lower bound as

∑T

t=1 ft (wt ) −

ft (w∗ ) ≥ −F T .
Remark 9.20. We observe that the introduction of quadratic regularizer δη∥λ∥2 /2 allows
[∑
]2
∑
T
us to turn the expression λi Tt=1 gi (wt ) into
g
(w
)
, leading to the bound for the
t=1 i t
+

violation of the constraints. In addition, the quadratic regularizer defined in terms of λ allows
us to work with unbounded λ because it cancels the contribution of the ∥λt ∥ terms from the
loss function and the bound on the gradients ∥∇w Lt (w, λ)∥. Note that the constraint for δ
mentioned in Theorem 9.19 is equivalent to

2

√
≤δ≤
1/(m + 1) + (m + 1)−2 − 8G2 η 2

1/(m + 1) +

√

(m + 1)−2 − 8G2 η 2
,
4η 2

(9.33)

from which, when T is large enough (i.e., η is small enough), we can simply set δ = 2(m +
1)G2 that will obey the constraint in (9.33).
By investigating Lemma 9.17, it turns out that the boundedness of the gradients is
essential to obtain bounds for Algorithm 14 in Theorem 9.19. Although, at each iteration,
λt is projected onto the Rm
+ , since W is a compact set and functions ft (w) and gi (w), i ∈ [m]
are convex, the boundedness of the functions implies that the gradients are bounded [22,
Proposition 4.2.3].

272

9.5.3

An Eﬃcient Algorithm with O(T 3/4 ) Regret Bound and without Violation of Constraints

In this subsection we generalize Algorithm 14 such that the constrained are satisfied in a long
run. To create a sequence of solutions {wt , t ∈ [T ]} that satisfies the long term constraints
∑T

t=1 gi (wt ) ≤ 0, i ∈ [m], we make two modifications to Algorithm 14.

First, instead

of handling all of the m constraints, we consider a single constraint defined as g(w) =
maxi∈[m] gi (w). Apparently, by achieving zero violation for the constraint g(w) ≤ 0, it
is guaranteed that all of the constraints gi (·), i ∈ [m] are also satisfied in the long term.
Furthermore, we change Algorithm 14 by modifying the definition of Lt (·, ·) as

Lt (w, λ) = ft (w) + λ(g(w) + γ) −

ηδ 2
λ ,
2

(9.34)

where γ > 0 will be decided later. This modification is equivalent to considering the constraint g(w) ≤ −γ, a tighter constraint than g(w) ≤ 0. The main idea behind this modification is that by using a tighter constraint in our algorithm, the resulting sequence of solutions
will satisfy the long term constraint

∑T

t=1 g(wt ) ≤ 0, even though the tighter constraint is

violated in many trials.
Before proceeding, we state a fact about the Lipschitz continuity of the function g(w) in
the following proposition.
Proposition 9.21. Assume that functions gi (·), i ∈ [m] are Lipschitz continuous with constant G. Then, function g(w) = maxi∈[m] gi (w) is Lipschitz continuous with constant G,
that is,
|g(w) − g(w′ )| ≤ G∥w − w′ ∥ for any w ∈ B and w′ ∈ B.

273

Proof. See Appendix ??.
To obtain a zero bound on the violation of constraints in the long run, we make the
following assumption about the constraint function g(w).
Assumption 9.22. Let W ′ ⊆ W be the convex set defined as W ′ = {w ∈ Rd : g(w)+γ ≤ 0}
where γ ≥ 0. We assume that the norm of the gradient of the constraint function g(w) is
lower bounded at the boundary of W ′ , that is,

A5

min
g(w)+γ=0

∥∇g(w)∥ ≥ σ.

A direct consequence of assumption A5 is that by reducing the domain W to W ′ , the
optimal value of the constrained optimization problem minw∈W f (w) does not change much,
as revealed by the following theorem.
Theorem 9.23. Let w∗ and wγ be the optimal solutions to the constrained optimization
problems defined as ming(w)≤0 f (w) and ming(w)≤−γ f (w), respectively, where f (w) =
∑T

t=1 ft (w) and γ ≥ 0. We have

|f (w∗ ) − f (wγ )| ≤

G
γT.
σ

Proof. We note that the optimization problem ming(w)≤−γ f (w) = ming(w)≤−γ

∑T

t=1 ft (w),

can also be written in the minimax form as

f (wγ ) = min max

w∈B λ∈R+

T
∑

ft (w) + λ(g(w) + γ),

t=1

274

(9.35)

where we use the fact that W ′ ⊆ W ⊆ B. We denote by wγ and λγ the optimal solutions
to (9.35). We have

f (wγ ) = min max

T
∑

w∈B λ∈R+

= min

w∈B

≤

T
∑

T
∑

ft (w) + λ(g(w) + γ)

t=1

ft (w) + λγ (g(w) + γ)

t=1

ft (w∗ ) + λγ (g(w∗ ) + γ) ≤

t=1

T
∑

ft (w∗ ) + λγ γ,

t=1

where the second equality follows the definition of the wγ and the last inequality is due
to the optimality of w∗ , that is, g(w∗ ) ≤ 0.
To bound |f (wγ ) − f (w∗ )|, we need to bound λγ . Since wγ is the minimizer of (9.35), from
the optimality condition we have

−

T
∑

∇ft (wγ ) = λγ ∇g(wγ ).

(9.36)

t=1

By setting v = −

∑T

t=1 ∇ft (wγ ), we can simplify (9.36) as λγ ∇g(wγ ) = v. From the KKT

optimality condition [34], if g(wγ ) + γ < 0 then we have λγ = 0; otherwise according to
Assumption 9.22 we can bound λγ by

λγ ≤

∥v∥
GT
≤
.
∥∇g(wγ )∥
σ

We complete the proof by applying the fact f (w∗ ) ≤ f (wγ ) ≤ f (w∗ ) + λγ γ.
As indicated by Theorem 9.23, when γ is small, we expect the diﬀerence between two

275

optimal values f (w∗ ) and f (wγ ) to be small. Using the result from Theorem 9.23, in the
following theorem, we show that by running Algorithm 14 on the modified convex-concave
functions defined in (9.34), we are able to obtain an O(T 3/4 ) regret bound and zero bound
on the violation of constraints in the long run.
Theorem 9.24. Set a = 2R/

√
√
2G2 + 3(D2 + b2 ), η = R2 /[a T ], and δ = 4G2 . Let

wt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 14 with functions defined in
√
(9.34) with γ = bT −1/4 and b = 2 F (δR2 a−1 + aR−2 ). Let w∗ be the optimal solution to
√
∑
minw∈W Tt=1 ft (w). With suﬃciently large T , that is, F T ≥ a T , and under Assump∑T

tion 9.22, we have wt , t ∈ [T ] satisfy the global constraint

t=1 g(wt ) ≤ 0 and the regret is

bounded by

RegretT =

T
∑
t=1

√
b
ft (wt ) − ft (w∗ ) ≤ a T + GT 3/4 = O(T 3/4 ).
σ

Proof. Let wγ be the optimal solution to ming(w)≤−γ

∑T

t=1 ft (w). Similar to the proof of

Theorem 9.19 when applied to functions in (9.34) we have
T
∑

ft (wt ) −

t=1

≤ −

T
∑
t=1

ft (w) + λ

T
∑


(g(wt ) + γ) − 

t=1

T
∑


λt  (g(w) + γ) −

t=1

δηT 2
λ
2

T
T
) η(
)∑
δη ∑ 2 R2 + λ2 ηT ( 2
λt +
+
2G + 3(D2 + γ 2 ) +
2G2 + 3δ 2 η 2
λ2t .
2
2η
2
2
t=1

t=1

By setting δ ≥ 2G2 + 3δ 2 η 2 which is satisfied by δ = 4G2 , we cancel the terms including λt
from the right hand side of above inequality. By maximizing for λ over the range (0, +∞)
and noting that γ ≤ b, for the optimal solution wγ , we have

276

T
∑

[
]
ft (wt ) − ft (wγ ) +

[∑

]2
T
t=1 g(wt ) + γT +
2(δηT + 1/η)

t=1

≤

)
R2 ηT ( 2
+
2G + 3(D2 + b2 ) ,
2η
2

which, by optimizing for η and applying the lower bound for the regret as

∑T

t=1 ft (wt ) −

ft (wγ ) ≥ −F T , yields the following inequalities
T
∑

√
ft (wt ) − ft (wγ ) ≤ a T

(9.37)

t=1

and
T
∑

√
g(wt ) ≤

(

√ )√
2 FT + a T
T

t=1

(

δR2
a
+ 2
a
R

)
− γT,

(9.38)

for the regret and the violation of the constraint, respectively. Combining (9.37) with the
√
∑T
∑T
result of Theorem 9.23 results in
f
(w
)
≤
f
(w
)
+
a
T + (G/σ)γT . By
γ
∗
t
t
t=1
t=1
choosing γ = bT −1/4 we attain the desired regret bound as
T
∑
t=1

√
bG 3/4
ft (wt ) − ft (w∗ ) ≤ a T +
T
= O(T 3/4 ).
σ

To obtain the bound on the violation of constraints, we note that in (9.38), when T is
√
√
∑
suﬃciently large, that is, F T ≥ a T , we have Tt=1 g(wt ) ≤ 2 F (δR2 a−1 + aR−2 )T 3/4 −
√
bT 3/4 . Choosing b = 2 F (δR2 a−1 + aR−2 )T 3/4 guarantees the zero bound on the violation
of constraints as claimed.

277

9.6

Proofs of Convergence Rates

In this section we provide the proof of the main lemmas omitted from the analysis of convergence rate.

9.6.1

Proof of Lemma 9.10

Following the standard analysis of gradient descent methods, we have for any w ∈ B,

′
∥wt+1 − w∥22 − ∥wt − w∥22 ≤ ∥wt+1
− w∥22 − ∥wt − w∥22

= ∥wt − ηt (g(wt , ξt ) + λt ∇g(wt )) − w∥22 − ∥wt − w∥22
≤ ηt2 ∥g(wt , ξt ) + λt ∇g(wt )∥22 − 2ηt (wt − w)⊤ (g(wt , ξt ) + λt ∇g(wt ))
≤ ηt2 ∥g(wt , ξt ) + λt ∇g(wt )∥22
− 2ηt (wt − w)⊤ (∇f (wt ) + λt ∇g(wt )) +2ηt (w − wt )⊤ (g(wt , ξt ) − ∇f (wt )),
≡∇w L(wt ,λt )

≡ζt (w)

Then we have

(wt − w)⊤ ∇w L(wt , λt )
) η
1 (
t
≤
∥wt − w∥22 − ∥wt+1 − w∥22 + ∥g(wt , ξt ) + λt ∇g(wt )∥22 + ζt (w)
2ηt
2
(
)
1
≤
∥wt − w∥22 − ∥wt+1 − w∥22 + ηt ∥g(wt , ξt )∥22 + ηt λ2t ∥∇g(wt )∥22 + ζt (w)
2ηt
)
1 (
2
2
≤
∥wt − w∥2 − ∥wt+1 − w∥2
2ηt
+ 2ηt ∥g(wt , ξt ) − ∇f (wt )∥22 +2ηt ∥∇f (wt )∥22 + ηt λ2t ∥∇g(wt )∥22 + ζt (w)
≡∆t

278

By using the bound on ∥∇f (wt )∥2 and ∥∇g(wt )∥2 , we obtain the first inequality in
Lemma 9.10. To prove the second inequality, we follow the same analysis, i.e.,

|λt+1 − λ|2 − |λt − λ|2 ≤ |λt + ηt (g(wt ) − γλt )|2 − |λt − λ|2
≤ ηt2 |g(wt ) − γλt |2 + 2ηt (λt − λ) (g(wt ) − γλt ) .
≡∇λ L(wt ,λt )

Then we have

(λ − λt )∇λ L(wt , λt ) ≤

) η
1 (
t
|λt − λ|2 − |λt+1 − λ|2 + |g(wt ) − γλt |2 .
2ηt
2

By induction, it is straightforward to show that λt ≤ C2 /γ, which yields the second
inequality in Lemma 9.10, i.e.,

(λ − λt )∇λ L(wt , λt ) ≤

9.6.2

)
1 (
|λt − λ|2 − |λt+1 − λ|2 + 2ηt C22 .
2ηt

Proof of Lemma 9.11

Since Lt (w, λ) is convex in w and concave in λ, we have the following inequalities

L(w, λt ) − L(wt , λt ) ≥ (w − wt )⊤ ∇w L(wt , λt ),
L(wt , λ) − L(wt , λt ) ≤ (λ − λt )∇λ L(wt , λt ).

279

Using the inequalities in Lemma 9.10, we have

1
2ηt
1
L(wt , λ) − L(wt , λt ) ≤
2ηt

L(wt , λt ) − L(w, λt ) ≤

(
(

)
∥w − wt ∥22 − ∥w − wt+1 ∥22 + 2ηt G21 + ηt G22 λ2t + 2ηt ∆t + ζt (w),
|λ − λt |2 − |λ − λt+1 |2

)

+ 2ηt C22 ,

where ζt (w) = ⟨w − wt , g(wt , ξt ) − ∇f (wt )⟩ as abbreviated before. Since η1 = · · · = ηT ,
denoted by η, by taking summation of above two inequalities over t = 1, · · · , T , we get

T
∑
t=1

∑
∑
∑
∥w∥22 λ2
L(wt , λ) − L(w, λt ) ≤
+
+ 2ηT (G21 + C22 ) +
ηG22 λ2t + 2η
∆t +
ζt (w).
2η
2η
t

T

T

t=1

t=1

By plugging the expression of L(w, λ), and due to ∥w∥2 ≤ 1, we have

T
∑
t=1

≤

(f (wt ) − f (w)) + λ

T
∑

(
g(wt ) −

t=1

γT
1
+
2
2η

)
λ2

∑
∑
∑
∑
1
+ 2ηT (G21 + C22 ) +
(ηG22 − γ/2)λ2t +
λt g(w) + 2η
∆t +
ζt (w).
2η
t

t

T

T

t=1

t=1

Let w = w∗ = arg minw∈W f (w). By taking minimization over λ ≥ 0 on left hand side
and considering η = γ/(2G22 ), we have

∑
T
T
T
∑
∑
[ Tt=1 g(wt )]2+
G22 (G21 + C22 )
γ ∑
(f (wt ) − f (w∗ )) +
≤
+
γT + 2
∆t +
ζt (w∗ )
2 /γ)
2
γ
2(γT
+
2G
G
G
2
2
2
t=1
t=1
t=1

280

9.6.3

Proof of Lemma 9.13

Since F (w) is strongly convex in w, we have

F (w) − F (wt ) ≥ ⟨w − wt , ∇F (wt )⟩ +

β
∥w − wt ∥22 .
2

Following the same analysis as in Lemma 9.10, we have

1
(wt − w)⊤ ∇F (wt ) ≤

2ηt

(

∥w − wt ∥22 − ∥w − wt+1 ∥22

)
+

ηt
∥g(wt , ξt ) + p(wt )λ0 ∇g(wt )∥22
2

β
+ ζt (w) − ∥w − wt ∥22
2
)
(
1
∥w − wt ∥22 − ∥w − wt+1 ∥22
≤
2ηt
β
+ ηt G21 + ηt λ20 G22 + ζt (w) − ∥w − wt ∥22 ,
2
where
p(w) =

exp (λ0 g(w)/γ)
.
1 + exp (λ0 g(w)/γ)

Taking summation of above inequality over t = 1, · · · , T gives

T
∑

F (wt ) − F (w) ≤

t=1

+

(
T
∑
1 1
t=1
T
∑

2

1

β
−
−
ηt ηt−1
2

ηt (G21 + λ20 G22 ) +

t=1

)

T
∑
t=1

Since ηt = 1/(2βt), we have

281

∥w − wt ∥22
ζt (w) −

T
β∑
∥w − wt ∥22 .
4
t=1

T
T
T
∑
(G21 + λ20 G22 )(1 + ln T ) ∑
β∑
(F (wt ) − F (w)) ≤
+
ζt (w) −
∥w − wt ∥22
2β
4
t=1

t=1

t=1

We complete the proof by letting w = w∗ = arg minw∈W f (w).

9.6.4

Proof of Lemma 9.14

The proof is based on the Berstein’s inequality for martingales (see Theorem A.25). To
do so, define the martingale diﬀerence Xt = ⟨w − wt , ∇f (wt ) − g(wt , ξt )⟩ and martingale
ΛT =

∑T

2
t=1 Xt . Define the conditional variance ΣT as

Σ2T =

T
∑
t=1

T
[ ]
∑
∥wt − w∥22 = 4G21 DT .
Eξt Xt2 ≤ 4G21
t=1

Define K = 4G1 . We have

(
)
√
√
2
Pr ΛT ≥ 2 4G1 DT τ + 2Kτ /3
(
)
√
√
2
2
2
= Pr ΛT ≥ 2 4G1 DT τ + 2Kτ /3, ΣT ≤ 4G1 DT
(
)
√
√
4
2
2
2
= Pr ΛT ≥ 2 4G1 DT τ + 2Kτ /3, ΣT ≤ 4G1 DT , DT ≤
T
)
(
m
√
∑
√
4
4
i−1
i
2
2
< DT ≤ 2
+
Pr ΛT ≥ 2 4G21 DT τ + 2Kτ /3, ΣT ≤ 4G1 DT , 2
T
T
i=1
)
(
√
(
) ∑
m
√
4
4
4
≤ Pr DT ≤
+
Pr ΛT ≥ 2 × 4G21 2i τ + 2Kτ /3, Σ2T ≤ 4G21 2i
T
T
T
i=1
(
)
4
≤ Pr DT ≤
+ me−τ .
T

282

where we use the fact ∥wt − w∥22 ≤ 4 for any w ∈ B, and the last step follows the
Bernstein inequality for martingales. We complete the proof by setting τ = ln(m/δ).

9.7

Summary

In this chapter, we made a progress towards making the SGD method eﬃcient by proposing
a framework in which it is possible to exclude the projection steps from the SGD algorithm. We have proposed two novel algorithms to overcome the computational bottleneck
of the projection step in applying SGD to optimization problems with complex domains.
We showed using novel theoretical analysis that the proposed algorithms can achieve an
√
O(1/ T ) convergence rate for general convex functions and an O(ln T /T ) rate for strongly
convex functions with a overwhelming probability which are known to be optimal (up to a
logarithmic factor) for stochastic optimization.
We have also addressed the problem of online convex optimization with constraints, where
we only need the constraints to be satisfied in the long run. In addition to the regret bound
which is the main tool in analyzing the performance of general online convex optimization
algorithms, we defined the bound on the violation of constraints in the long term which
measures the cumulative violation of the solutions from the constraints for all rounds. Our
setting is applied to solving online convex optimization without projecting the solutions
onto the complex convex domain at each iteration, which may be computationally ineﬃcient
for complex domains. Our strategy is to turn the problem into an online convex-concave
optimization problem and apply online gradient descent algorithm to solve it.

283

9.8

Bibliographic Notes

Generally, the computational complexity of the projection step in SGD has seldom been
taken into account in the literature. Here, we briefly review the previous works on projection free convex optimization, which is closely related to the theme of this study. For some
specific domains, eﬃcient algorithms have been developed to circumvent the high computational cost caused by projection step at each iteration of gradient descent methods. The
main idea is to select an appropriate direction to take from the current solution such that
the next solution is guaranteed to stay within the domain. Clarkson [42] proposed a sparse
greedy approximation algorithm for convex optimization over a simplex domain, which is
a generalization of an old algorithm by Frank and Wolfe [56] (a.k.a conditional gradient
descent [21]). Zhang [155] introduced a similar sequential greedy approximation algorithm
for certain convex optimization problems over a domain given by a convex hull. Hazan [67]
devised an algorithm for approximately maximizing a concave function over a trace norm
bounded PSD cone, which only needs to compute the maximum eigenvalue and the corresponding eigenvector of a symmetric matrix. Ying et al. [151] formulated the distance metric
learning problems into eigenvalue maximization and proposed an algorithm similar to [67].
Recently, Jaggi [77] put these ideas into a general framework for convex optimization with
a general convex domain. Instead of projecting the intermediate solution into a complex
convex domain, Jaggi’s algorithm solves a linearized problem over the same domain. He
showed that Clark’s algorithm , Zhang’s algorithm and Hazan’s algorithm discussed above
are special cases of his general algorithm for special domains. It is important to note that all
these algorithms are designed for batch optimization, not for stochastic optimization, which
is the focus of this chapter.

284

The proposed stochastic optimization methods with only one projection are closely related to the online Frank-Wolfe algorithm proposed in [73]. It is a projection free online
learning algorithm, built on the the assumption that it is possible to eﬃciently minimize a
linear function over the complex domain. Indeed, the FW algorithm replaces the projection
into the convex domain with a linear programming over the same domain which makes sense
if solving the linear programming is cheaper than the projection. One main shortcoming
of the OFW algorithm is that its convergence rate for general stochastic optimization is
O(T −1/3 ), significantly slower than that of a standard stochastic gradient descent algorithm
(i.e., O(T −1/2 )). It achieves a convergence rate of O(T −1/2 ) only when the objective function
is smooth, which unfortunately does not hold for many machine learning problems where
either a non-smooth regularizer or a non-smooth loss function is used. Another limitation
of OFW is that it assumes a linear optimization problem over the domain W can be solved
eﬃciently. Although this assumption holds for some specific domains as discussed in [73],
but in many settings of practical interest, this may not be true. The proposed algorithms
address the two limitations explicitly. In particular, we show that how two seemingly different modifications of the SGD can be used to avoid performing expensive projections with
similar convergency rates as the original SGD method.
The proposed online optimization with soft constraints setup is reminiscent of regret
minimization with side constraints or constrained regret minimization addressed in [110],
motivated by applications in wireless communication. In regret minimization with side
constraints, beyond minimizing regret, the learner has some side constraints that need to
be satisfied on average for all rounds. Unlike our setting, in learning with side constraints,
the set W is controlled by adversary and can vary arbitrarily from trial to trial. It has been
shown that if the convex set is aﬀected by both decisions and loss functions, the minimax
285

optimal regret is generally unattainable online [111].

286

APPENDIX

287

Appendix A
Technical Background

A.1

Convex Analysis

In this appendix, we introduce the basic definitions and results of convex analysis [121, 34, 28]
needed for the analysis of the algorithms in this thesis. We begin by formalizing a few
topological definitions we used throughout the thesis followed by few definitions and results
from convexity.
Definition A.1 (Euclidean Ball). A Euclidean ball with radius r centered at point w0 is
B(w0 , r) = {w ∈ Rd ; ∥w − w0 ∥ ≤ r}. An ℓp ball is also defined equivalently but now the
distance is defined by the ℓp norm as ∥w∥p = (

∑d

p 1/p .
i=1 |wi | )

Definition A.2 (Boundary and Interior). For a given set W, we say that w is on the
boundary of W, denoted by ∂W, if for some small ϵ > 0, the ball centered at w with radius
¯ ̸= ∅,
ϵ covers both inside and outside of the set, i.e., B(w0 , ϵ) ∩ W =
̸ ∅ and B(w0 , ϵ) ∩ W
¯ is the complement of set W. We say that a point w is in the interior of set W if
where W
∃ϵ > 0 s.t. B(w0 , ϵ) ⊂ W.
Definition A.3 (Convex Set). A set W in a vector space is convex if for any two vectors
w, w′ ∈ W, the line segment connecting two points is contained in W as well. In other fords,
for any λ ∈ [0, 1], we have that λw + (1 − λ)w′ ∈ W.

288

Intuitively, a set is convex if its surface has no ”dips”. The intersection of an arbitrary
family of convex sets is obviously convex.
Definition A.4 (Projection onto Convex Sets). Let W ⊆ Rd be a closed convex set and let
w ∈ Rd be a point. The projection of w onto W is denoted by ΠW (w) and defined by
ΠW (w) = arg min ∥w − w′ ∥.
w′ ∈W

In particular, for every w ∈ Rd there exists a point w∗ which satisfies this and it is unique.
Here are few examples of projections. For the positive semidefinite (PSD) cone, i.e.,
Sd+ = {X ∈ Rd×d : X ⪰ 0}, the projection requires a full eigendecomposition of the
input matrix. Precisely, let W ∈ Sd+ has the eigendecomposition W = U⊤ ΣU where
Σ = diag(λ1 , λ2 , · · · , λd ) and U is an orthogonal matrix. Then, Π d (W) = U⊤ Σ+ U where
S

+
d
Σ+ = diag([λ1 ]+ , [λ2 ]+ , · · · , [λd ]+ ). For Hyperplane H = {x ∈ R : ⟨a, x⟩ = b}, a ̸= 0, the
⟨a,w⟩−b

a. For Aﬃne set A = {x ∈ Rd : Ax = b},
projection is equivalent to ΠH (w) = w −
∥a∥2
(
)
the projection can be obtained by ΠA (w) = w −A⊤ AA−1 (Aw −b). For Polyhedron set,
P = {x ∈ Rd : Ax ≤ b} there is no analytical solution and the projection ΠP (w) is an oﬄine
optimization by itself where the corresponding optimization problem arg minw′ ∈P ∥w′ − w∥
must be solved approximately.
Lemma A.5 (Non-expensiveness of Projection). Let W ⊆ Rd be a convex set and consider
the projection operation defined as above. Then,

∥ΠW (w) − ΠW (w′ )∥ ≤ ∥w − w′ ∥, ∀w, w′ ∈ Rd

Definition A.6 (Convex Function). A function f : W → R is said to be convex if W is
289

convex and for every w, w′ ∈ W and α ∈ [0, 1],

f (λw + (1 − λ)w′ ) ≤ λf (w) + (1 − λ)f (w′ ).

A continuously diﬀerentiable function is convex if f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ for all
w, w′ ∈ W. If f is non-smooth then this inequality holds for any sub-gradient g ∈ ∂f (w′ ).
Definition A.7 (Subgradient). A subgradient of a convex function f : Rd → R at some
point w is any vector g ∈ Rd that achieves the same lower bound as the tangent line to
function f at point w, i.e.,

f (w′ ) ≥ f (w) + ⟨g, w′ − w⟩, ∀w′ ∈ Rd
The subgradient g always exists for convex functions on the relative interior of their
domain. Furthermore, if f is diﬀerentiable at w, then there is a unique subgradient g =
∇f (w). Note that subgradients need not exist for non-convex functions.
Definition A.8 (Subdiﬀerential). The subdiﬀerential of a convex function f : Rd → R at
some point w is the set of all subgradients of f at w , i.e.,

∂f (w) = {g ∈ Rd : ∀w′ , f (w′ ) ≥ f (w) + ⟨g, w′ − w⟩}

An important property of subdiﬀerential set of a function ∂f is that it is closed and
convex set, even for non-convex functions which is straightforward to verify by following the
definition. Moreover, the subdiﬀerential is also always nonempty for convex functions. This
is a consequence of the supporting hyperplane theorem, which states that at any point on the
boundary of a convex set, there exists at least one supporting hyperplane. Since the epigraph
290

of a convex function is a convex set, we can apply the supporting hyperplane theorem to
the set of points (w; f (w)), which are exactly the boundary points of the epigraph. The
subdiﬀerential of a diﬀerentiable function contains only single element which is the gradient
of function at that point, i.e., ∂f (w) = {∇f (w)}.
We are now prepared to introduce the concept of Lipschitz continuity, designed to measure
change of function values versus change in the independent variable for a general function
f (·).
Definition A.9 (Lipschitzness). A function f : W → R is ρ-Lipschitz over the set W if for
every w, w′ ∈ W we have that |f (w) − f (w′ )| ≤ ρ||w − w′ ||.

The following lemma shows a result on the Lipschitz continuity of a function which is
defined as the maximum of other Lipschitz continuous functions.
Lemma A.10. Assume that functions fi : W → R, i ∈ [m] are Lipschitz continuous with
constant ρ. Then, function f (w) = maxi∈[m] fi (w) is Lipschitz continuous with constant ρ,
that is,
|f (w) − f (w′ )| ≤ ρ∥w − w′ ∥ for any w, w′ ∈ W.
∑
Proof. First, we rewrite f (w) = maxi∈[m] fi (w) as f (w) = maxα∈∆m m
i=1 αi fi (w) where
∑m
∆m is the m-simplex, that is, ∆m = {α ∈ Rm
+ ; i=1 αi = 1}. Then, we have

|f (w) − f (w′ )| =
≤
≤

max

m
∑

α∈∆m
i=1
m
∑

max

α∈∆m
i=1
m
∑

max

α∈∆m
i=1

αi fi (w) − max
αi fi (w) −

m
∑

α∈∆m
i=1
m
∑

αi fi (w′ )

αi fi (w′ )

i=1

αi fi (w) − fi (w′ ) ≤ ρ∥w − w′ ∥,

291

where the last inequality follows from the Lipschitz continuity of fi (w), i ∈ [m].
Definition A.11 (Smoothness). A diﬀerentiable function f : W → R is said to be β-smooth
with respect to a norm ∥ · ∥, if it holds that

f (w) ≤ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ +

β
∥w − w′ ∥2 , ∀ w, w′ ∈ W.
2

(A.1)

The following result is an important property of smooth functions which has been utilized
in the proof of few results in the thesis.
Lemma A.12 (Self-bounding Property). For any β-smooth non-negative function f : R →
√
R, we have |f ′ (w)| ≤ 4βf (w)
As a simple proof, first from the smoothness assumption, by setting w1 = w2 − β1 f ′ (w2 ) in
1 |f ′ (w )|2 . On the other hand,
(A.1) and rearranging the terms we obtain f (w2 )−f (w1 ) ≥ 2β
2

from the convexity of loss function we have f (w1 ) ≥ f ′ (w2 ) + ⟨f ′ (w1 ), w1 − w2 ⟩. Combining
these inequalities and considering the fact that the function is non-negative gives the desired
inequality.

Definition A.13 (Strong Convexity). A function f (w) is said to be α-strongly convex w.r.t
a norm ∥ · ∥, if there exists a constant α > 0 (often called the modulus of strong convexity)
such that, for any λ ∈ [0, 1] and for all w, w′ ∈ W, it holds that
1
f (λw + (1 − λ)w′ ) ≤ αf (w) + (1 − λ)f (w′ ) − λ(1 − λ)α∥w − w′ ∥2 .
2
If f (w) is twice diﬀerentiable, then an equivalent definition of strong convexity is ∇2 f (w) ⪰
αI which indicates that the smallest eigenvalue of the Hessian of f (w) is uniformly lower
292

bounded by α everywhere. When f (w) is diﬀerentiable, the strong convexity is equivalent
to
f (w) ≥ f (w′ ) + ⟨∇f (w′ ), w − w′ ⟩ +

α
∥w − w′ ∥2 , ∀ w, w′ ∈ W.
2

An important property of strongly convex functions that we used in the proof of few
results in the following:
Lemma A.14. Let f (w) be a α-strongly convex function over the domain W, and w∗ =
arg minw∈W f (w). Then, for any w ∈ W, we have

f (w) − f (w∗ ) ≥

α
∥w − w∗ ∥2 .
2

(A.2)

Definition A.15 (Dual Norm). Let ∥ · ∥ be any norm on Rd . Its dual norm denoted by ∥ · ∥∗
is defined by
∥w′ ∥∗ = sup ⟨w′ , w⟩ − ∥w∥
w∈Rd

An equivalent definition is ∥w′ ∥∗ = supw∈Rd {⟨w, w′ ⟩|∥w∥ ≤ 1}. If p, q ∈ [1, ∞] satisfy
1/p + 1/q = 1, then the ℓp and ℓq norms are dual to each other.
Definition A.16 (Fenchel Conjugate 1 ). The Fenchel conjugate of a function f : W → R
is defined as
f ∗ (v) = sup ⟨w, v⟩ − f (w).
w∈W

An immediate result of above definition is the Fenchel-Young inequality stating that for
any w and v we have that ⟨w, v⟩ ≤ f (w) + f ∗ (v).
Definition A.17 (Bregman Divergence). Let Φ : W → R be a continuously-diﬀerentiable
1 Also known as the convex conjugate or Legendre-Fenchel transformation

293

real-valued and strictly convex function defined on a closed convex set W. The Bregman
divergence between w and w′ is the diﬀerence at w between f and a linear approximation
around w′
BΦ (w, w′ ) = Φ(w) − Φ(w′ ) − ⟨w − w′ , ∇Φ(w′ )⟩.
For example the squared Euclidean distance B(w, w′ ) = ∥w − w′ ∥2 is the canonical
example of a Bregman distance, generated by the convex function Φ(w) = ∥w∥2 . The
entropy function Φ(p) =

∑

divergence as BKL (p, q) =

A.2

i pi log pi −

∑

∑

pi gives rises to the generalized Kullback-Leibler

∑
∑
p
pi log qi − pi + qi .
i

Concentration Inequalities

This appendix is a quick excursion into concentration of measure. We consider some important concentration results used in the proofs given in this thesis. Concentration inequalities
give probability bounds for a random variable to be concentrated around its mean (e.g.,
see [49] , [30] and [31] for a through discussion and derivation of these inequalities).
Lemma A.18 (Markov’s Inequality). For a non-negative random variable X and t > 0,

P[X ≥ t] ≤

E[X]
t

The upside of Markov’s inequality is that it does not need almost no assumptions about
the random variable, but the downside is that it only gives very weak bounds.
Lemma A.19 (Hoeﬀding’s Lemma). Let X be any real-valued random variable with expected
294

value E[X] = 0 and such that X ∈ [a, b] almost surely. Then, for all λ ∈ R,
[

E eλX

]

)
( 2
λ (b − a)2
≤ exp
.
8

Theorem A.20 (Hoeﬀding’s Inequality). Let X1 , · · · , Xn be a sequence of i.i.d random
variables. Assume Xi ∈ [ai , bi ] and let S = X1 + X2 + · · · + Xn . Then,
(
P(S − E[S] ≥ t) ≤ exp − ∑n

2t2
2
i=1 (bi − ai )

(
P(|S − E[S]| ≥ t) ≤ 2 exp − ∑n

)

2t2
2
i=1 (bi − ai )

,
)
.

The following result extends Hoeﬀding’s inequality to more general functions f (x1 , x2 , · · · , xn ).
Theorem A.21 (McDiarmid’s Inequality). Let X1 , · · · , Xn be independent real-valued random variables. Suppose that

sup

x1 ,x2 ,··· ,xn ,x′i

f (x1 , · · · , xi−1 , xi , xi+1 , · · · , xn ) − f (x1 , · · · , xi−1 , x′i , xi+1 , · · · , xn ) ≤ ci ,

for i = 1, 2, · · · , n. Then,

(

2t2

)

P (|f (X1 , X2 , · · · , Xn ) − E [f (X1 , X2 , · · · , Xn )]| ≥ t) ≤ 2 exp − ∑n

2
i=1 ci

Hoeﬀding’s inequality does not use any information about the random variables except
295

the fact that they are bounded. If the variance of Xi , i ∈ [n] is small, then we can get
a sharper inequality from Bennett’s inequality and its simplified version in Bernstein’s Inequality.
Theorem A.22 (Bennett’s Inequality). Let X1 , · · · , Xn be a sequence of i.i.d random variables. Assume E[Xi ] = 0, E[Xi2 ] = σ 2 , and |Xi | ≤ M, i ∈ [n]. Then,

P

( n
∑

)
Xi ≥ t

(
≤ exp

i=1

−nσ 2
ϕ
M2

(

tM
nσ 2

))
,

where ϕ(x) = (1 + x) log(1 + x) − x.
Theorem A.23 (Bernstein’s Inequality). Let X1 , . . . , Xn be independent zero-mean E[Xi ] =
0, i ∈ [n] random variables. Suppose that |Xi | ≤ M almost surely, for all i ∈ [n]. Then, for
all positive t,

P

( n
∑
i=1

)
Xi > t




t2

) .
≤ exp − (∑ [ ]
1
2
2
E Xj + 3 M t

The next result is an extension of Hoeﬀding’s inequality to martingales with zero-mean
and bounded increments which is due to [10]. Before stating the Hoeﬀding-Azuma’s inequality, we need few definitions. A sequence of random variables (Xi )i∈N on a probability space
(Ω, A, P) is a martingale diﬀerence sequence with respect to a filtration (Fi )i∈N if and only
if, or all i ≥ 1, each random variable Xi is Fi -measurable and satisfies,

E[Xi |Fi−1 ] = 0.

Theorem A.24 (Hoeﬀding-Azuma Inequality). Let X1 , X2 , . . . be a martingale diﬀerence
sequence with respect to a filtration F = (Fi )i∈N . Assume that for all i ≥ 1, there exists a
296

Fi−1 -measurable random variable Ai and a non-negative constant ci such that Xi ∈ [Ai , Ai +
ci ]. Then, the martingale Sn defined by Sn =

∑n

i=1 Xi , satisfies for any t > 0,

(

2t2

P[Sn > t] ≤ exp − ∑n

2
i=1 ci

)
.

The Hoeﬀding-Azuma’s inequality indicates that Sn is sub-gaussian with variance
i.e.,

∑n

2
i=1 ci /4,

(
)
]
−2t2
P max Si ≥ t ≤ exp ∑n 2 .
1≤i≤n
i=1 ci
[

The following inequality which is due to Freedman [57] extends Bernstein’s result to the
case of discrete-time martingales with bounded jumps (a.k.a Freedman’s inequality). This
result demonstrates that a martingale exhibits normal-type concentration near its mean
value on a scale determined by the predictable quadratic variation of the sequence.
Theorem A.25 (Bernstein’s Inequality for Martingales). Let X1 , . . . , Xn be a bounded martingale diﬀerence sequence with respect to the filtration F = (Fi )1≤i≤n and with |Xi | ≤ M .
Let
Si =

i
∑

Xj

j=1

be the associated martingale. Denote the sum of the conditional variances by

Σ2n =

n
[
]
∑
E Xi2 |Fi−1 ,
i=1

Then for all constants t, ν > 0,
]

[
P max Si > t and Σ2n ≤ ν
i=1,...,n

297

(
≤ exp −

t2
2(ν + M t/3)

)
,

and therefore,

[
P

A.3

max Si >

i=1,...,n

√

]
√
2
2νt +
M t and Σ2n ≤ ν ≤ e−t .
3

Miscellanea

A.3.1

Extra-gradient Lemma

Lemma A.26. Let Φ(z) be a α-strongly convex function with respect to the norm ∥·∥, whose
dual norm is denoted by ∥ · ∥∗ , and B(w, z) = Φ(w) − (Φ(z) + (w − z)⊤ Φ′ (z)) be the Bregman
distance induced by function Φ(w). Let Z be a convex compact set, and U ⊆ Z be convex
and closed. Let z ∈ Z, γ > 0, Consider the points,

w = arg min γu⊤ ξ + B(u, z),

(A.3)

z+ = arg min γu⊤ ζ + B(u, z),

(A.4)

u∈U

u∈U

then for any u ∈ U , we have

γζ ⊤ (w − u) ≤ B(u, z) − B(u, z+ ) +

γ2
α
∥ξ − ζ∥2∗ − [∥w − z∥2 + ∥w − z+ ∥2 ].
α
2

Proof. By using the definition of Bregman distance B(u, z), we can write two updates in (A.3)
as

298

w = arg min u⊤ (γξ − Φ′ (z)) + Φ(u),
u∈U

(A.5)

z+ = arg min u⊤ (γζ − Φ′ (z)) + Φ(u),
u∈U

by the first oder optimality condition, we have
(u − w)⊤ (γξ − Φ′ (z) + Φ′ (w)) ≥ 0, ∀u ∈ U,
(A.6)
(u − z+ )⊤ (γζ − Φ′ (z) + Φ′ (z+ )) ≥ 0, ∀u ∈ U
Applying (A.6) with u = z+ and u = w, we get
γ(w − z+ )⊤ ξ ≤ (Φ′ (z) − Φ′ (w))⊤ (w − z+ ),
(A.7)
γ(z+ − w)⊤ ζ ≤ (Φ′ (z) − Φ′ (z+ ))⊤ (z+ − w).
Summing up the two inequalities, we have

γ(w − z+ )⊤ (ξ − ζ) ≤ (Φ′ (z+ ) − Φ′ (w))⊤ (w − z+ ).

(A.8)

Then

γ∥ξ − ζ∥∗ ∥w − z+ ∥ ≥ −γ(w − z+ )⊤ (ξ − ζ) ≥ (Φ′ (z+ ) − Φ′ (w))⊤ (z+ − w)
≥ α∥z+ − w∥2 .
where in the last inequality, we use the strong convexity of Φ(w).

299

(A.9)

B(u, z) − B(u, z+ ) = Φ(z+ ) − Φ(z) + (u − z+ )⊤ Φ′ (z+ ) − (u − z)⊤ Φ′ (z)
=Φ(z+ ) − Φ(z) + (u − z+ )⊤ Φ′ (z+ ) − (u − z+ )⊤ Φ′ (z) − (z+ − z)⊤ Φ′ (z)
=Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) + (u − z+ )⊤ (Φ′ (z+ ) − Φ′ (z))
=Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) + (u − z+ )⊤ (γζ + Φ′ (z+ ) − Φ′ (z)) − (u − z+ )⊤ γζ
≥Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (u − z+ )⊤ γζ
= Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γζ +(w − u)⊤ γζ,
ϵ

(A.10)
where the inequality follows from (A.6). We proceed by bounding ϵ as:

ϵ =Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γζ
=Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) − (w − z+ )⊤ γξ
=Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ)
+ (z+ − w)⊤ (γξ − Φ′ (z) + Φ′ (w)) − (z+ − w)⊤ (Φ′ (w) − Φ′ (z))
≥Φ(z+ ) − Φ(z) − (z+ − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) − (z+ − w)⊤ (Φ′ (w) − Φ′ (z))
=Φ(z+ ) − Φ(z) − (w − z)⊤ Φ′ (z) − (w − z+ )⊤ γ(ζ − ξ) − (z+ − w)⊤ Φ′ (w)
[
] [
]
⊤
′
⊤
′
= Φ(z+ ) − Φ(w) − (z+ − w) Φ (w) + Φ(w) − Φ(z) − (w − z) Φ (z)
−(w − z+ )⊤ γ(ζ − ξ)
α
α
≥ ∥w − z+ ∥2 + ∥w − z∥2 − γ∥w − z+ ∥∥ζ − ξ∥∗
2
2
α
γ2
≥ {∥w − z+ ∥2 + ∥w − z∥2 } − ∥ζ − ξ∥2∗ ,
2
α

300

(A.11)

where the first inequality follows from (A.6), the second inequality follows from the strong
convexity of Φ(w), and the last inequality follows from (A.10). Combining the above results,
we have

γ(w − u)⊤ ζ ≤ B(u, z) − B(u, z+ ) +

A.3.2

γ2
α
∥ζ − ξ∥2∗ − {∥w − z+ ∥2 + ∥w − z∥2 }.
α
2

(A.12)

Stochastic Mirror Descent for Smooth Losses

In this subsection we provide a tight analysis of stochastic mirror decent algorithm for
smooth losses. We will show that the convergence rate of the mirror descent algorithm [119]
for stochastic optimization of both non-smooth and smooth functions is dominated by an
√
O(1/ T ) convergence rate. We consider the nonlinear projected subgradient formulation of
the mirror descent from [17] which iteratively updates the solution by:
{
}
1
wt+1 = arg min ⟨w, gt ⟩ + BΦ (w, wt ) ,
ηt
w∈W

(A.13)

where gt is a (sub)gradient of f (w) at point wt and BΦ (·, ·) is the Bregmen diveregnce defined
with strongly convex function Φ(·). Define the average solution as wT =

∑T

t=1 wt /T and let

g(w, ξ) denote a stochastic gradient of function at point w, i.e., E[g(w, ξ)] = ∇f (w). We
state few standard assumptions we make in our analysis about the stochastic optimization
problem.
A1. (Lipschitz Continuity) E[∥g(w, ξ)∥2∗ ] ≤ G2

301

A2. (Bounded Variance) E[∥∇f (w) − g(w, ξ)∥] ≤ σ 2 .
A3. (Smoothness) The function f has L Lipschitz continuous gradient.
A4. (Compactness) BΦ (w∗ , w) ≤ R2 .

The proof is extensively uses the following key inequalities:
(1) Optimality condition

⟨w − wt+1 , gt +

1
1
∇Φ(wt+1 ) − ∇Φ(wt )⟩ ≥ 0
ηt
ηt

(2) Generalized inequality

BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 ) − BΦ (wt+1 , wt ) = ⟨w∗ − wt+1 , ∇Φ(wt+1 ) − ∇Φ(wt )⟩

(3) Fenchel-Yang inequality applied to the conjugate pair 12 ∥ · ∥2 and 12 ∥ · ∥2∗ yields

⟨gt , wt − wt+1 ⟩ ≤

1
ηt
∥gt ∥2∗ +
∥wt − wt+1 ∥2
2
2ηt

(4)
1
BΦ (w, y) ≥ ∥w − y∥2
2
(5) From smoothness assumption we have

f (w′ ) − f (w) − ⟨∇f (w), w′ − w⟩ ≤

L ′
∥w − w∥2
2

The analysis of the convergence rate of basic mirror descent algorithm follows standard
techniques in convex optimization and for completeness we provide a proof here following [17].
302

Lemma A.27. We have

f (wt ) − f (w∗ ) ≤

1
ηt
||gt ||2∗ + [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )]
2
ηt

Proof. From the convexity of f (w) and by setting ζt = ∇f (wt ) − gt we have

f (wt ) − f (w∗ )
≤ ⟨∇f (wt ), wt − w∗ ⟩
≤ ⟨gt , wt − w∗ ⟩ + ⟨∇f (wt ) − gt , wt − w∗ ⟩
(1)

≤ ⟨gt , wt − w∗ ⟩ + ⟨w∗ − wt+1 , gt +

= ⟨gt , wt − wt+1 ⟩ +
(2) η
t

≤

2

∥gt ∥2∗ +

1
1
∇Φ(wt+1 ) − ∇Φ(wt )⟩ + ⟨ζt , wt − w∗ ⟩
ηt
ηt

1
⟨w∗ − wt+1 , ∇Φ(wt+1 ) − ∇Φ(wt )⟩ + ⟨ζt , wt − w∗ ⟩
ηt

1
1
∥wt − wt+1 ∥2 + [BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )
2ηt
ηt

− BΦ (wt+1 , wt )] + ⟨ζt , wt − w∗ ⟩
(3) η
t

≤

2

∥gt ∥2∗ +

1
[B (w∗ , wt ) − BΦ (w∗ , wt+1 )] + ⟨ζt , wt − w∗ ⟩
ηt Φ

Theorem A.28. By setting leaning rate as ηt =

R
√ we have the following rate for the
G t

convergence of stochastic mirror descent for non-smooth objectives:

RG
E[f (wT )] − f (w∗ ) ≤ O( √ )
T

303

Proof. Summing up Lemma 1 for all iterations we get
T
∑
t=1

∑
1
1
1
f (wt ) − f (w∗ ) ≤ BΦ (w∗ , w1 ) +
BΦ (w∗ , wt )( −
)
η1
ηt ηt−1
T

t=2

+

T
∑
ηt
t=1

≤

2

∥gt ∥2∗ +

T
∑

⟨ζt , wt − w∗ ⟩

t=1

T
R 2 G2 ∑
+
ηt
ηT
2
t=1

√

= O(RG T ),
where the second inequality follows since each wt is a deterministic function of ξ1 , · · · , ξt−1 ,
so E[⟨∇f (wt ) − gt , wt − w∗ ⟩|ξ1 , · · · , ξt−1 ] = 0
We generalize the proof for smooth functions.
√
1
Theorem A.29. By setting ηt = L+β
where βt = (σ/R) t, the following holds for the
t
convergence rate of stochastic mirror descent for smooth functions:

E[f (wT )] − f (w∗ ) ≤ O(

LR2
σR
+√ )
T
T

Proof.

f (wt ) − f (w∗ ) ≤ ⟨∇f (wt ), wt − w∗ ⟩
≤ ⟨∇f (wt ), wt − wt+1 ⟩ + ⟨∇f (wt ), wt+1 − w∗ ⟩
(5)

≤ f (wt ) − f (wt+1 ) +

L
∥wt+1 − wt ∥2 + ⟨∇f (wt ), wt+1 − w∗ ⟩
2

304

By rearranging the terms we proceed as follows:

f (wt+1 ) − f (w∗ )
≤ ⟨∇f (wt ), wt+1 − w∗ ⟩ +

L
∥wt+1 − wt ∥2
2

L
∥wt+1 − wt ∥2 + ⟨ζt , wt+1 − w∗ ⟩
2
(1) 1
L
≤ ⟨w∗ − wt+1 , ∇Φ(wt+1 ) − ∇Φ(wt )⟩ + ∥wt+1 − wt ∥2 + ⟨ζt , wt+1 − w∗ ⟩
ηt
2
≤ ⟨gt , wt+1 − w∗ ⟩ +

(2) 1

≤

ηt

(4) 1

≤

ηt

[BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 ) − BΦ (wt , wt+1 )] +

L
∥wt+1 − wt ∥2 + ⟨ζt , wt+1 − w∗ ⟩
2

[BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )] − (L + βt )BΦ (wt , wt+1 ) + LBΦ (wt , wt+1 )

+ ⟨ζt , wt+1 − w∗ ⟩
=

1
[B (w∗ , wt ) − BΦ (w∗ , wt+1 )] − βt BΦ (wt , wt+1 ) + ⟨ζt , wt+1 − wt ⟩ + ⟨ζt , wt − w∗ ⟩
ηt Φ

(3) 1

≤

ηt

[BΦ (w∗ , wt ) − BΦ (w∗ , wt+1 )] +

1
∥ζt ∥2∗ + ⟨ζt , wt − w∗ ⟩
2βt

Summing for all t yields
T
∑

f (wt+1 ) − f (w∗ )

t=1

∑
∑ ∥ζt ∥2 ∑
1
1
1
∗
≤ BΦ (w∗ , w1 ) +
BΦ (w∗ − wt )( −
)+
+
⟨ζt , wt − w∗ ⟩
η1
ηt ηt−1
2βt
T

T

T

t=2

t=1

t=1

T
T
∑
R2
1
1
σ2 ∑ 1
2
( −
≤
+R
)+
+0
η1
ηt ηt−1
2
βt
t=2

≤

R2
ηT

+

t=1

√
σR √
T = O(LR2 + σR T ),
2

where in the second in equality we used the fact E[⟨∇f (wt )−gt , wt −w∗ ⟩|ξ1 , · · · , ξt−1 ] =
0. By averaging the solutions over all iterations and applying the Jensen’s inequality we
obtain the convergence rate claimed in the theorem.

305

306

BIBLIOGRAPHY

307

BIBLIOGRAPHY

[1] Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal
strategies and minimax lower bounds for online convex games. In Proceedings of the
Nineteenth Annual Conference on Computational Learning Theory, 2008.
[2] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An
eﬃcient algorithm for bandit linear optimization. In COLT, pages 263–274, 2008.
[3] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Interior-point methods for fullinformation and bandit online learning. IEEE Transactions on Information Theory,
58(7):4164–4175, 2012.
[4] Alekh Agarwal, Peter L. Bartlett, Pradeep D. Ravikumar, and Martin J. Wainwright.
Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012.
[5] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex
optimization with multi-point bandit feedback. In Proceedings of The 23rd Conference
on Learning Theory (COLT), pages 28–40, 2010.
[6] Amit Agarwal, Elad Hazan, Satyen Kale, and Robert E. Schapire. Algorithms for
portfolio management based on the newton method. In Proceedings of the 23rd international conference on Machine learning, pages 9–16, 2006.
[7] M. Anthony and P.L. Bartlett. Neural network learning: Theoretical foundations.
Cambridge University Press, 1999.
[8] Sanjeev Arora, L´aszl´o Babai, Jacques Stern, and Z Sweedyk. The hardness of approximate optima in lattices, codes, and systems of linear equations. In Foundations
of Computer Science, 1993. Proceedings., 34th Annual Symposium on, pages 724–733.
IEEE, 1993.
[9] Baruch Awerbuch and Robert D. Kleinberg. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In Proceedings of the thirty-sixth
annual ACM symposium on Theory of computing, STOC ’04, pages 45–53, New York,
NY, USA, 2004. ACM.

308

[10] Kazuoki Azuma et al. Weighted sums of certain dependent random variables. Tohoku
Mathematical Journal, 19(3):357–367, 1967.
[11] K. Bache and M. Lichman. UCI machine learning repository, 2013.
[12] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true
sample complexity of active learning. Machine Learning, 80(2-3):111–139, 2010.
[13] Peter L Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The
Annals of Statistics, 33(4):1497–1537, 2005.
[14] Peter L Bartlett, Michael I Jordan, and Jon D McAuliﬀe. Convexity, classification,
and risk bounds. Journal of the American Statistical Association, 101(473):138–156,
2006.
[15] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk
bounds and structural results. The Journal of Machine Learning Research, 3:463–482,
2003.
[16] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods.
IMA Journal of Numerical Analysis, 8(1):141–148, 1988.
[17] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient
methods for convex optimization. Oper. Res. Lett., 31(3):167–175, 2003.
[18] Robert Bell and Thomas M Cover. Game-theoretic optimal portfolios. Management
Science, 34(6):724–733, 1988.
[19] Shai Ben-David, David Loker, Nathan Srebro, and Karthik Sridharan. Minimizing the
misclassification error rate using a surrogate convex loss. In ICML, 2012.
[20] Shai Ben-David, D´avid P´al, and Shai Shalev-Shwartz. Agnostic online learning. In
COLT, 2009.
[21] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.
[22] Dimitri P. Bertsekas, Angelia Nedic, and Asuman E. Ozdaglar. Convex Analysis and
Optimization. Athena Scientific, 2003.
[23] David Blackwell. An analog of the minimax theorem for vector payoﬀs. Pacific Journal
of Mathematics, 1(6), 1956.
309

[24] David Blackwell. Minimax vs. bayes prediction. Probability in the Engineering and
Informational Sciences, 9(01):53–58, 1995.
[25] HD Block. The perceptron: A model for brain functioning. i. Reviews of Modern
Physics, 34(1):123, 1962.
[26] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the vapnik-chervonenkis dimension. J. ACM, 36(4):929–965, 1989.
[27] Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis. Cambridge University Press, 1998.
[28] Jonathan M Borwein and Adrian S Lewis. Convex analysis and nonlinear optimization:
theory and examples, volume 3. Springer, 2010.
[29] Leon Bottou and Olivier Bousquet. The tradeoﬀs of large scale learning. In NIPS,
pages 161–168, 2008.
[30] St´ephane Boucheron, G´abor Lugosi, and Olivier Bousquet. Concentration inequalities.
In Advanced Lectures on Machine Learning, pages 208–240. Springer, 2004.
[31] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities:
A nonasymptotic theory of independence. Oxford University Press, 2013.
[32] Olivier Bousquet and Andr´e Elisseeﬀ. Stability and generalization. The Journal of
Machine Learning Research, 2:499–526, 2002.
[33] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
[34] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
[35] Augustin Cauchy. M´ethode g´en´erale pour la r´esolution des systemes d´equations simultan´ees. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
[36] Nicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability
of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–
2057, 2004.

310

[37] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E
Schapire, and Manfred K Warmuth. How to use expert advice. Journal of the ACM
(JACM), 44(3):427–485, 1997.
[38] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
[39] Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging
algorithm. In Advances in neural information processing systems, pages 297–305, 2009.
[40] Chao-Kai Chiang, Chia-Jung Lee, and Chi-Jen Lu. Beating bandits in gradually evolving worlds. In COLT, pages 210–227, 2013.
[41] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu,
Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, 2012.
[42] Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe
algorithm. ACM Transactions on Algorithms, 6(4), 2010.
[43] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
[44] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch
algorithms via accelerated gradient methods. In NIPS, pages 1647–1655, 2011.
[45] Thomas M. Cover. Behavior of sequential predictors of binary sequences. Transactions
of the Prague Conferences on Information Theory, 1965.
[46] Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information
for online optimization. In NIPS, 2007.
[47] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed
online prediction using mini-batches. The Journal of Machine Learning Research,
13:165–202, 2012.
[48] Ofer Dekel and Yoram Singer. Data-driven online to batch conversions. In NIPS, 2005.
[49] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the
analysis of randomized algorithms. Cambridge University Press, 2009.

311

[50] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing
for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
[51] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Eﬃcient
projections onto the l1-ball for learning in high dimensions. In ICML, pages 272–279,
2008.
[52] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the
number of examples needed for learning. Information and Computation, 82(3):247–261,
1989.
[53] Tim V Erven, Wouter M Koolen, Steven D Rooij, and Peter Gr¨
unwald. Adaptive
hedge. In Advances in Neural Information Processing Systems, pages 1656–1664, 2011.
[54] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online
convex optimization in the bandit setting: gradient descent without a gradient. In
SODA, pages 385–394, 2005.
[55] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the vapnikchervonenkis dimension. Machine Learning, 21(3):269–304, 1995.
[56] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research
Logistics, 3, 1956.
[57] David A Freedman. On tail probabilities for martingales. the Annals of Probability,
pages 100–118, 1975.
[58] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line
learning and an application to boosting. In Computational learning theory, pages 23–
37. Springer, 1995.
[59] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative
weights. Games and Economic Behavior, 29(1):79–103, 1999.
[60] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods
for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
[61] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:
a statistical view of boosting (with discussion and a rejoinder by the authors). The
annals of statistics, 28(2):337–407, 2000.

312

[62] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for
strongly convex stochastic composite optimization i: a generic algorithmic framework.
SIAM Journal on Optimization, 22(4):1469–1492, 2012.
[63] Geoﬀrey J Gordon. Regret bounds for prediction problems. In Proceedings of the
twelfth annual conference on Computational learning theory, pages 29–40. ACM, 1999.
[64] John Greenstadt. On the relative eﬃciencies of gradient methods. Mathematics of
Computation, pages 360–367, 1967.
[65] James Hannan. Approximation to bayes risk in repeated play. Contributions to the
Theory of Games, 3:97–139, 1957.
[66] Steve Hanneke. Theoretical Foundations of Active Learning. PhD thesis, 2009.
[67] Elad Hazan. Sparse approximate solutions to semidefinite programs. In LATIN, pages
306–316, 2008.
[68] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online
convex optimization. Machine Learning, 69(2-3):169–192, 2007.
[69] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded
by variation in costs. In COLT, pages 57–68, 2008.
[70] Elad Hazan and Satyen Kale. Better algorithms for benign bandits. In SODA, pages
38–47, 2009.
[71] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded
by variation in costs. Machine learning, 80(2-3):165–188, 2010.
[72] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: an optimal
algorithm for stochastic strongly-convex optimization. COLT, 19:421–436, 2011.
[73] Elad Hazan and Satyen Kale. Projection-free online learning. In ICML, 2012.
[74] Elad Hazan and Tomer Koren. Optimal algorithms for ridge and lasso regression with
partially observed attributes. CoRR, 2011.
[75] Klaus-Uwe Hoﬀgen, Hans-Ulrich Simon, and Kevin S Vanhorn. Robust trainability of
single neurons. Journal of Computer and System Sciences, 50(1):114–125, 1995.
313

[76] Anatoli Iouditski and Yuri Nesterov.
Primal-dual subgradient methods
for minimizing uniformly convex functions.
available at http://hal.archivesouvertes.fr/docs/00/50/89/33/PDF/Strong-hal.pdf, 2010.
[77] Martin Jaggi. Sparse Convex Optimization Methods for Machine Learning. PhD thesis,
ETH Zurich, October 2011.
[78] Martin Jaggi and Marek Sulovsk´
y. A simple algorithm for nuclear norm regularized
problems. In ICML, pages 471–478, 2010.
[79] Kevin G Jamieson, Robert D Nowak, and Benjamin Recht. Query complexity of
derivative-free optimization. In NIPS, 2012.
[80] Wenxin Jiang. Process consistency for adaboost. The Annals of Statistics, 32(1):13–29,
2004.
[81] Anatoli Juditsky and Yuri Nesterov. Primal-dual subgradient methods for minimizing
uniformly convex functions. Technical report, 2010.
[82] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear
prediction: Risk bounds, margin bounds, and regularization. In NIPS, pages 793–800,
2008.
[83] Sham M. Kakade and Ambuj Tewari. On the generalization ability of online strongly
convex programming algorithms. In Advances in Neural Information Processing Systems 21, pages 801–808, 2008.
[84] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
[85] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
[86] Adam Tauman Kalai and Santosh Vempala. Eﬃcient algorithms for online decision
problems. J. Comput. Syst. Sci., 71:291–307, October 2005.
[87] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward eﬃcient agnostic
learning. Machine Learning, 17(2-3):115–141, 1994.
[88] J. Kivinen, A. J. Smola, and R. C. Williamson. Online Learning with Kernels. IEEE
Transactions on Signal Processing, 52:2165–2176, 2004.
314

[89] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient
updates for linear prediction. In Proceedings of the 27th annual ACM symposium on
Theory of computing, Proceedings of the Twenty-Seventh Annual ACM Symposium
on Theory of Computing, pages 209–218, New York, NY, USA, 1995. ACM.
[90] Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
[91] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. Information Theory, IEEE Transactions on, 47(5):1902–1914, 2001.
[92] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse
Recovery Problems. Lecture Notes in mathematics. Springer, 2011.
[93] Guanghui Lan. Eﬃcient methods for stochastic composite optimization. Technical
report, 2008.
[94] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
[95] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory,
44(5):1974–1980, 1998.
[96] Qihang Lin, Xi Chen, and Javier Pena. A smoothing stochastic gradient method for
composite optimization. arXiv preprint arXiv:1008.5204, 2010.
[97] Yi Lin. A note on margin-based loss functions in classification. Statistics & probability
letters, 68(1):73–82, 2004.
[98] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine learning, 2(4):285–318, 1988.
[99] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
[100] Jun Liu and Jieping Ye. Eﬃcient euclidean projections in linear time. In ICML, pages
83–90, 2009.
[101] G´abor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of regularized boosting methods. Annals of Statistics, pages 30–55, 2004.
315

[102] Mehrdad Mahdavi and Rong Jin. Passive learning with target risk. COLT, 2013.
[103] Mehrdad Mahdavi and Rong Jin. Excess risk bounds for exponentially concave losses.
arXiv preprint arXiv:1401.4566, 2014.
[104] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for eﬃciency: Online convex optimization with long term constraints. Journal of Machine Learning
Research, 13:2503–2528, 2012.
[105] Mehrdad Mahdavi, Tianbao Yang, and Rong Jin. Stochastic convex optimization with
multiple objectives. In NIPS, 2013.
[106] Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi. Stochastic
gradient descent with only one projection. In NIPS, pages 503–511, 2012.
[107] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In NIPS, 2013.
[108] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Binary excess risk for smooth convex
surrogates. arXiv preprint arXiv:1402.1792, 2014.
[109] Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis. The
Annals of Statistics, 27(6):1808–1829, 1999.
[110] Shie Mannor and John N. Tsitsiklis. Online learning with constraints. In COLT, pages
529–543, 2006.
[111] Shie Mannor, John N. Tsitsiklis, and Jia Yuan Yu. Online learning with sample path
constraints. Journal of Machine Learning Research, 10:569–590, 2009.
[112] David A McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh
annual Conference on Computational learning theory, pages 230–234. ACM, 1998.
[113] H. Brendan McMahan and Matthew J. Streeter. Open problem: Better bounds for
online logistic regression. Journal of Machine Learning Research - Proceedings Track,
23:44.1–44.3, 2012.
[114] Hariharan Narayanan and Alexander Rakhlin. Random walk approach to regret minimization. In NIPS, pages 1777–1785, 2010.

316

[115] Arkadai Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization,
19(4):1574–1609, 2009.
[116] Arkadi Nemirovski. Eﬃcient methods in convex programming. Lecture Notes, Available at http://www2.isye.gatech.edu/ nemirovs, 1994.
[117] Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave
saddle point problems. SIAM J. on Optimization, 15:229–251, 2005.
[118] Arkadi S Nemirovski and Michael J Todd. Interior-point methods for optimization.
Acta Numerica, 17:191–234, 2008.
[119] Arkadi Nemirovsky and David Borisovich Yudin. Problem complexity and method
eﬃciency in optimization. Wiley- Interscience Series in Discrete Mathematics. John
Wiley, 1983.
[120] Yurii Nesterov. A method of solving a convex programming problem with convergence
rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
[121] Yurii Nesterov. Introductory lectures on convex optimization: a basic course, volume 87
of Applied optimization. Kluwer Academic Publishers, 2004.
[122] Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM
Journal on Optimization, 16(1):235–249, 2005.
[123] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005.
[124] Yurii Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1), 2013.
[125] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent
optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 449–456, 2012.
[126] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random
averages, combinatorial parameters, and learnability. In NIPS, pages 1984–1992, 2010.

317

[127] Aaditya Ramdas and Aarti Singh. Optimal stochastic convex optimization through
the lens of active learning. In ICML, 2013.
[128] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method
with an exponential convergence rate for finite training sets. In Advances in Neural
Information Processing Systems 25 (NIPS), pages 2672–2680, 2012.
[129] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations
and Trends in Machine Learning, 4(2):107–194, 2012.
[130] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From
Theory to Algorithms. Cambridge University Press, 2014.
[131] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability and stability in the general learning setting. COLT, 2009.
[132] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic
convex optimization. In COLT, 2009.
[133] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research,
11:2635–2670, 2010.
[134] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated
sub-gradient solver for svm. In ICML, pages 807–814, 2007.
[135] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for
regularized loss minimization. JMLR, 14:567599, 2013.
[136] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, 2013.
[137] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. ICML, 2013.
[138] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and
fast rates. In NIPS, pages 2199–2207, 2010.
[139] Karthik Sridharan. Learning from an optimization viewpoint. PhD Thesis, 2012.

318

[140] Karthik Sridharan, Shai Shalev-Shwartz, and Nathan Srebro. Fast rates for regularized
objectives. In NIPS, pages 1545–1552, 2008.
[141] Ingo Steinwart. Consistency of support vector machines and other regularized kernel
classifiers. Information Theory, IEEE Transactions on, 51(1):128–142, 2005.
[142] Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates.
Journal Machine Learnning Research, 4:773–818, December 2003.
[143] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization, 2009.
[144] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–
1142, 1984.
[145] Vladimir N Vapnik. Statistical learning theory. 1998.
[146] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability & Its Applications,
16(2):264–280, 1971.
[147] Volodimir G Vovk. Aggregating strategies. In Proc. Third Workshop on Computational
Learning Theory, pages 371–383. Morgan Kaufmann, 1990.
[148] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural
computation, 8(7):1341–1390, 1996.
[149] Qiang Wu and Ding-Xuan Zhou. Svm soft margin classifiers: Linear programming
versus quadratic programming. Neural Computation, 17(5):1160–1187, 2005.
[150] Tianbao Yang, Mehrdad Mahdavi, Rong Jin, and Shenghuo Zhu. Regret bounded
by gradual variation for online convex optimization. Journal of Machine Learning
Research-Proceedings Track, 23:6–1, 2012.
[151] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization.
JMLR., 13:1–26, 2012.
[152] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition
number independent access of full gradients. In NIPS, 2013.

319

[153] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Recovering the optimal solution by dual random projection. In COLT, 2013.
[154] Lijun Zhang, Tianbao Yang, Rong Jin, and Xiaofei He. O(logt) projections for stochastic optimization of smooth and strongly convex functions. ICML, 2013.
[155] Tong Zhang. Sequential greedy approximation for certain convex optimization problems. Information Theory, IEEE Transactions on, 49:682–691, 2003.
[156] Tong Zhang. Statistical behavior and consistency of classification methods based on
convex risk minimization. Annals of Statistics, pages 56–85, 2004.
[157] Ding-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. Information
Theory, IEEE Transactions on, 49(7):1743–1752, 2003.
[158] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient
ascent. In ICML, pages 928–936, 2003.

320