# Introduction

There are many algorithms in Online Learning. The list includes Gradient Descent, Exponential Weights, Mirror Descent, Follow the Regularized Leader and more (see e.g. (Shalev-Shwartz 2012)). So it is always good to see how these algorithms relate, and to judge how different they really are. In this post we look at three basic algorithms and show that they are in fact obtained from the same master algorithm. That is, we show that

1. Gradient Descent (tuned for convex loss functions),

2. Gradient Descent (tuned for strongly convex loss functions), and

3. Online Newton Step (for exp-concave loss functions)

are all instances of Exponential Weights with certain so-called surrogate losses. We give a uniform derivation of the algorithm and its regret analysis, and recover the known bounds in each case.

I addressed the first case above in this earlier post, and encountered the connection with the third case while working on (Erven and Koolen 2016). A similar connection between Exponential Weights and the Mirror Descent family of algorithms is studied by (Hoeven 2016).

# Setting: Online Optimisation

We look at online optimisation over a convex set $${{\mathcal U}}\subseteq {{\mathbb R}}^d$$. Learning proceeds in rounds. In round $$t$$ the learner plays a point $${{\boldsymbol w}}_t \in {{\mathcal U}}$$. Then the adversary reveals a loss function $$f_t : {{\mathcal U}}\to {{\mathbb R}}$$, upon which the learner incurs loss $$f_t({{\boldsymbol w}}_t)$$. We evaluate the learner by its so-called regret. After $$T$$ rounds, the regret compared to a point $${{\boldsymbol u}}\in {{\mathcal U}}$$ is defined by $$\label{eq:regret} R_T^{{\boldsymbol u}}~{:=}~ \sum_{t=1}^T {\big(f_t({{\boldsymbol w}}_t) - f_t({{\boldsymbol u}})\big)} .$$ The goal of the learner is to ensure small regret compared to all $${{\boldsymbol u}}\in {{\mathcal U}}$$.

# Assumption: Quadratic Lower Bound

We will now assume that the functions $$f_t$$ have a particular structure, namely that we can lower bound them by a quadratic. Recall that in round $$t$$ the learner plays $${{\boldsymbol w}}_t \in {{\mathcal U}}$$ and encounters loss function $$f_t : {{\mathcal U}}\to {{\mathbb R}}$$. We assume that the learner is given a vector $${{\boldsymbol g}}_t \in {{\mathbb R}}^d$$ (typically $${{\boldsymbol g}}_t = \nabla f_t({{\boldsymbol w}}_t)$$) and matrix $${{\boldsymbol M}}_t \succeq {{\boldsymbol 0}}$$ such that for all $${{\boldsymbol u}}\in {{\mathcal U}}$$ $$\label{eq:assn} f_t({{\boldsymbol u}}) ~\ge~ {\ell}_t({{\boldsymbol u}}) ,$$ where $${\ell}_t$$ is the convex quadratic surrogate loss defined by ${\ell}_t({{\boldsymbol u}}) ~{:=}~ f_t({{\boldsymbol w}}_t) + {\left\langle{{\boldsymbol u}}-{{\boldsymbol w}}_t, {{\boldsymbol g}}_t\right\rangle} + \frac{1}{2} ({{\boldsymbol u}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol w}}_t) .$ Note that by construction $$f_t({{\boldsymbol w}}_t) = {\ell}_t({{\boldsymbol w}}_t)$$ is tight. Hence the regret $$R_T^{{\boldsymbol u}}$$ can be bounded by the surrogate regret $$\tilde R_T^{{\boldsymbol u}}$$, i.e.  $$\label{eq:surr.vs.real.regret} R_T^{{\boldsymbol u}}~=~ \sum_{t=1}^T {\left(f_t({{\boldsymbol w}}_t) - f_t({{\boldsymbol u}})\right)} ~\le~ \sum_{t=1}^T {\left({\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol u}})\right)} ~=~ \tilde R_T^{{\boldsymbol u}}.$$ The following figure illustrates the loss function and the surrogate regret at various points $${{\boldsymbol w}}$$ in $$d=1$$ dimension.

Now how to control that surrogate regret $$\tilde R_T^{{\boldsymbol u}}$$? Let’s play standard Exponential Weights! Hold on, why not play Exponential Weights on the original losses $$f_t$$ directly? Well, the cool thing about Exponential Weights on the quadratic surrogate losses $${\ell}_t$$ (and with Gaussian prior) is that all quantities of interest can be computed exactly in closed form. This is a major computational advantage, and it also provides more explicit insight into what the algorithm “thinks”.

# Exponential Weights

We now analyze the behavior of the standard Exponential Weights (technically the exponentially weighted average forecaster) applied to the surrogate loss $${\ell}_t$$. The Exponential Weights strategy will produce a sequence of distributions $$P_1, P_2, \ldots$$ on $${{\mathbb R}}^d$$. It will be convenient to maintain weights on all of $${{\mathbb R}}^d$$ instead of only on $${{\mathcal U}}$$, but we will ensure that the each $$P_t$$ has its mean in $${{\mathcal U}}$$. We will start out with Gaussian prior $P_1 ~{:=}~ {{\mathcal N}}({{\boldsymbol 0}}, {{\boldsymbol \Sigma}}) ,$ with zero mean (here we assume that $${{\boldsymbol 0}}\in {{\mathcal U}}$$) and covariance $${{\boldsymbol \Sigma}}\succeq {{\boldsymbol 0}}$$. Then in each round $$t$$ we play the mean of $$P_t$$ ${{\boldsymbol w}}_t ~{:=}~ \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} {\left[{{\boldsymbol u}}\right]} .$ Playing the mean is a better idea than randomisation since $${\ell}_t({{\boldsymbol w}}_t) \le \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})}{\left[{\ell}_t({{\boldsymbol u}})\right]}$$ by Jensen’s inequality. Exponential Weights subsequently updates the distribution to $P_{t+1} ~{:=}~ \min_{P \in \mathcal P}~ \operatorname*{KL}(P\|P_t) + \eta \operatorname*{\mathbb E}_{P({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol u}})\right]} ,$ where $\mathcal P ~{:=}~ {\left\{P\middle|\operatorname*{\mathbb E}_{P({{\boldsymbol u}})}[{{\boldsymbol u}}] \in {{\mathcal U}}\right\}}$ is the set of distributions on $${{\mathbb R}}^d$$ with mean in $${{\mathcal U}}$$ and $$\eta > 0$$ is the learning rate (a parameter we will tune below). It is customary and convenient to equivalently decompose the update in the following two steps $P_{t+1} ~=~ \min_{P \in \mathcal P}~ \operatorname*{KL}(P\|\tilde P_{t+1}) \qquad \text{where} \qquad \tilde P_{t+1} ~{:=}~ \min_{P}~ \operatorname*{KL}(P\|P_t) + \eta \operatorname*{\mathbb E}_{P({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol u}})\right]} .$ We then find that both $$P_{t+1}$$ and $$\tilde P_{t+1}$$ are Gaussian, in particular $P_{t+1} ~=~ {{\mathcal N}}({{\boldsymbol w}}_{t+1}, {{\boldsymbol \Sigma}}_{t+1}) \qquad \text{and} \qquad \tilde P_{t+1} ~=~ {{\mathcal N}}(\tilde {{\boldsymbol w}}_{t+1}, {{\boldsymbol \Sigma}}_{t+1}) ,$ where \begin{align*} {{\boldsymbol \Sigma}}_{t+1}^{-1} &~=~ {{\boldsymbol \Sigma}}_t^{-1} + \eta {{\boldsymbol M}}_t \\ \tilde {{\boldsymbol w}}_{t+1} &~=~ {{\boldsymbol w}}_t - \eta {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t \\ {{\boldsymbol w}}_{t+1} &~=~ \operatorname*{argmin}_{{{\boldsymbol w}}\in {{\mathcal U}}}~ ({{\boldsymbol w}}- \tilde {{\boldsymbol w}}_{t+1})^{\intercal}{{\boldsymbol \Sigma}}_{t+1}^{-1} ({{\boldsymbol w}}-\tilde {{\boldsymbol w}}_{t+1}) \end{align*} So far so good. Even though the Exponential Weights algorithm maintains distributions $$P_t$$ on $${{\mathbb R}}^d$$ (which could have arbitrarily high complexity), it “collapses”, meaning that it can be implemented by maintaining just $${{\boldsymbol w}}_t$$ and $${{\boldsymbol \Sigma}}_t$$, which require at most $$d$$ and $$d^2$$ parameters respectively. This is the big advantage of working with a quadratic loss and a Guassian prior.

# Regret Analysis

We now perform the standard analysis of the Exponential Weights algorithm. We start by observing that $$\operatorname*{KL}$$ is non-negative and satisfies the Pythagorean Inequality. Hence for any $$Q \in \mathcal P$$ \begin{align*} \operatorname*{KL}(Q\|P_1) &~\ge~ \operatorname*{KL}(Q\|P_1) - \operatorname*{KL}(Q\|P_{T+1}) \\ &~=~ \sum_{t=1}^T {\big( \operatorname*{KL}(Q\|P_t) - \operatorname*{KL}(Q\|P_{t+1})\big)} \\ &~\ge~ \sum_{t=1}^T {\big( \operatorname*{KL}(Q\|P_t) - \operatorname*{KL}(Q\|\tilde P_{t+1})\big)} \\ &~=~ \sum_{t=1}^T {\left( \eta \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol u}})\right]} - \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} \right)} . \end{align*} Interestingly, these are the only two bounds necessary to analyse the surrogate regret. In the remainder we will merely simplify with equality. Reorganising the above yields $$\label{eq:surr.regret} \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[\tilde R_T^{{\boldsymbol u}}\right]} ~\le~ \frac{ \operatorname*{KL}(Q\|P_1) + \sum_{t=1}^T \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} }{ \eta } .$$ It will be sufficient to consider Gaussian comparators $Q ~=~ {{\mathcal N}}{({{\boldsymbol \nu}}, {{\boldsymbol \Delta}})}$ with mean $${{\boldsymbol \nu}}\in {{\mathcal U}}$$ and covariance $${{\boldsymbol \Delta}}\succeq {{\boldsymbol 0}}$$. We now separately expand the three major sub-expressions of $$\eqref{eq:surr.regret}$$. First, \begin{align*} &\qquad \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol u}})\right]} \\ &~=~ \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[ - {\left\langle{{\boldsymbol u}}-{{\boldsymbol w}}_t, {{\boldsymbol g}}_t\right\rangle} - \frac{1}{2} ({{\boldsymbol u}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol w}}_t) \right]} \\ &~=~ - {\left\langle{{\boldsymbol \nu}}-{{\boldsymbol w}}_t, {{\boldsymbol g}}_t\right\rangle} - \frac{1}{2} ({{\boldsymbol \nu}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol \nu}}-{{\boldsymbol w}}_t) - \frac{1}{2} \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[ ({{\boldsymbol u}}-{{\boldsymbol \nu}})^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol \nu}}) \right]} \\ &~=~ {\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol \nu}}) - \frac{1}{2} \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol M}}_t\right)} \end{align*} from which it follows that $$\label{eq:expcomp} \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[\tilde R_T^{{\boldsymbol u}}\right]} ~=~ \tilde R_T^{{\boldsymbol \nu}}- \frac{1}{2} \operatorname*{tr}{\left({{\boldsymbol \Delta}}\sum_{t=1}^T {{\boldsymbol M}}_t\right)} .$$ Second, since $$P_t = {{\mathcal N}}({{\boldsymbol w}}_t, {{\boldsymbol \Sigma}}_t)$$, we have \begin{align*} &\qquad \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} \\ &~=~ \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta {\left( {\left\langle{{\boldsymbol w}}_t-{{\boldsymbol u}}, {{\boldsymbol g}}_t\right\rangle} - \frac{1}{2} ({{\boldsymbol u}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol w}}_t) \right)} )} \\ &~=~ \frac{\eta^2}{2} {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t - \frac{1}{2} \ln \frac{ |{{\boldsymbol \Sigma}}_t| }{ |{{\boldsymbol \Sigma}}_{t+1}| } , \end{align*} so that $$\label{eq:cummixloss} \sum_{t=1}^T \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} ~=~ \frac{\eta^2}{2} \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t - \frac{1}{2} \ln \frac{ |{{\boldsymbol \Sigma}}| }{ |{{\boldsymbol \Sigma}}_{T+1}| } .$$ Finally, since $$Q = {{\mathcal N}}({{\boldsymbol \nu}}, {{\boldsymbol \Delta}})$$ and $$P_1 = {{\mathcal N}}({{\boldsymbol 0}}, {{\boldsymbol \Sigma}})$$ are Gaussian, $$\label{eq:kl} \operatorname*{KL}{(Q\|P_1)} ~=~ \frac{1}{2} {\left( \ln \frac{|{{\boldsymbol \Sigma}}|}{|{{\boldsymbol \Delta}}|} + \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol \Sigma}}^{-1}\right)} - d + {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}\right)} .$$ Substituting the three equalities $$\eqref{eq:expcomp}$$, $$\eqref{eq:cummixloss}$$ and $$\eqref{eq:kl}$$ into our bound $$\eqref{eq:surr.regret}$$ results in $\tilde R_T^{{\boldsymbol \nu}}- \frac{1}{2} \operatorname*{tr}{\left({{\boldsymbol \Delta}}\sum_{t=1}^T {{\boldsymbol M}}_t\right)} ~\le~ \frac{ \ln \frac{|{{\boldsymbol \Sigma}}|}{|{{\boldsymbol \Delta}}|} + \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol \Sigma}}^{-1}\right)} - d + {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \eta^2 \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t - \ln \frac{ |{{\boldsymbol \Sigma}}| }{ |{{\boldsymbol \Sigma}}_{T+1}| } }{ 2 \eta } ,$ that is (using $${{\boldsymbol \Sigma}}_{T+1}^{-1} = {{\boldsymbol \Sigma}}^{-1} + \eta \sum_{t=1}^T {{\boldsymbol M}}_t$$) $\tilde R_T^{{\boldsymbol \nu}}~\le~ \frac{ \ln \frac{|{{\boldsymbol \Sigma}}_{T+1}|}{|{{\boldsymbol \Delta}}|} + \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol \Sigma}}_{T+1}^{-1}\right)} - d + {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \eta^2 \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t }{ 2 \eta } .$ As this holds for all $${{\boldsymbol \Delta}}$$, we may use the optimal value, which we find by setting the derivative to zero, giving ${{\boldsymbol \Delta}}~=~ {{\boldsymbol \Sigma}}_{T+1} ,$ and plugging that in simplifies the expression considerably, yielding $\tilde R_T^{{\boldsymbol \nu}}~\le~ \frac{1}{2 \eta} {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \frac{\eta}{2} \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t .$ Finally, since $$R_T^{{\boldsymbol u}}\le \tilde R_T^{{\boldsymbol u}}$$ for all $${{\boldsymbol u}}\in {{\mathcal U}}$$ by $$\eqref{eq:surr.vs.real.regret}$$, we obtain regret bound $$\label{eq:mainbound} R_T^{{\boldsymbol \nu}}~\le~ \frac{1}{2 \eta} {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \frac{\eta}{2} \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t .$$ The more variance $${{\boldsymbol \Sigma}}_{t+1}$$ captures (i.e. the smaller it is), the better the second term. But to make $${{\boldsymbol \Sigma}}_{t+1}$$ small we only have one knob, namely starting with a small $${{\boldsymbol \Sigma}}$$. We then see that there is a trade-off, as small $${{\boldsymbol \Sigma}}$$ will blow up the first term. The next step toward making the bound explicit depends on the details of the scenario. Next we consider the three cases mentioned in the introduction.

# Applications

We now recover the three algorithms and their analysis from the introduction. Throughout we take spherical prior covariance $${{\boldsymbol \Sigma}}= \sigma^2{{\boldsymbol I}}$$ and use $${{\boldsymbol g}}_t = \nabla f_t({{\boldsymbol w}}_t)$$. So the three cases only differ in their choice of $${{\boldsymbol M}}_t$$. Let us also make the standard boundedness assumptions $${\|{{\boldsymbol g}}_t\|} \le G$$ and $${\|{{\boldsymbol \nu}}\|} \le D$$ for each $${{\boldsymbol \nu}}\in {{\mathcal U}}$$.

## Convex

For convex loss functions $$f_t$$ we may satisfy $$\eqref{eq:assn}$$ by taking $${{\boldsymbol M}}_t = {{\boldsymbol 0}}$$. We then find $${{\boldsymbol \Sigma}}_{t+1} = {{\boldsymbol \Sigma}}$$, indicating that the algorithm, which is Gradient Descent, does not learn the covariance. Then by $$\eqref{eq:mainbound}$$ \begin{align*} R_T^{{\boldsymbol \nu}}&~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{\eta \sigma^2}{2} \sum_{t=1}^T {\|{{\boldsymbol g}}_t\|}^2 \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{\eta \sigma^2}{2} G^2 T . \end{align*} This bound is optimized for $\eta ~=~ \frac{D}{\sigma^2 G \sqrt{T}} ,$ where it reads $R_T^{{\boldsymbol \nu}}~\le~ G D \sqrt{T} .$

## Strongly Convex

For $$\beta$$-strongly convex loss functions $$f_t$$, we may take $${{\boldsymbol M}}_t = \beta {{\boldsymbol I}}$$ to satisfy $$\eqref{eq:assn}$$. We then find $${{\boldsymbol \Sigma}}_{t+1} = \frac{1}{\frac{1}{\sigma^2} + \beta \eta t}{{\boldsymbol I}}$$, a scaled-down identity. The fact that the covariance matrix does not depend on the data but only on the index of the current round indicates that we are running Gradient Descent with a “time-varying learning rate” of order $$1/t$$. Plugging this into $$\eqref{eq:mainbound}$$, we find \begin{align*} R_T^{{\boldsymbol \nu}}&~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{\eta}{2} \sum_{t=1}^T \frac{1}{\frac{1}{\sigma^2} + \beta \eta t} {\|{{\boldsymbol g}}_t\|}^2 \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{\eta G^2}{2} \sum_{t=1}^T \frac{1}{\frac{1}{\sigma^2} + \beta \eta t} \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{\eta G^2}{2} \int_0^T \frac{1}{\frac{1}{\sigma^2} + \beta \eta t} {\,\textrm{d}}t \\ &~=~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{1}{2 \beta} G^2 \ln {\left(1 + \beta \eta \sigma^2 T\right)} . \end{align*} This suggests tuning $\eta ~=~ \operatorname*{argmin}_\eta~ {\left\{ \frac{1}{2 \eta \sigma^2} D^2 + \frac{1}{2 \beta} G^2 \ln {\left(\beta \eta \sigma^2 T\right)} \right\}} ~=~ \frac{\beta D^2}{\sigma^2 G^2} ,$ resulting in $R_T^{{\boldsymbol \nu}}~\le~ \frac{G^2}{2 \beta} {\left( 1 + \ln {\left(1 + \frac{\beta^2 D^2}{G^2} T\right)} \right)} .$

## Exp-concave

For $$\beta$$-exp-concave loss functions $$f_t$$, we may satisfy assumption $$\eqref{eq:assn}$$ by taking $${{\boldsymbol M}}_t = \beta {{\boldsymbol g}}_t {{\boldsymbol g}}_t^{\intercal}$$. We then find that the algorithm is Online Newton Step by (Hazan, Agarwal, and Kale 2007). To analyse its regret, we invoke concavity of the log-determinant to obtain $\ln |{{\boldsymbol \Sigma}}_{t+1}^{-1}| - \ln |{{\boldsymbol \Sigma}}_t^{-1}| ~\ge~ \eta \operatorname*{tr}{\left( {{\boldsymbol M}}_t {{\boldsymbol \Sigma}}_{t+1} \right)} ~=~ \beta \eta {{\boldsymbol g}}_t^{\intercal}{{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t .$ Applying this to $$\eqref{eq:mainbound}$$ we conclude \begin{align*} R_T^{{\boldsymbol \nu}}&~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{1}{2 \beta} \ln \det {\left({{\boldsymbol I}}+ \beta \eta \sigma^2 \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol g}}_t^{\intercal}\right)} \\ &~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{d}{2 \beta} \ln {\left( 1 + \frac{\beta \eta \sigma^2}{d} \sum_{t=1}^T {\left\|{{\boldsymbol g}}_t\right\|}^2 \right)} \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{d}{2 \beta} \ln {\left( 1 + \frac{\beta \eta \sigma^2}{d} G^2 T \right)} . \end{align*} This suggest tuning $\eta ~=~ \operatorname*{argmin}_\eta~ {\left\{ \frac{1}{2 \eta \sigma^2} D^2 + \frac{d}{2 \beta} \ln {\left( \frac{\beta \eta \sigma^2}{d} G^2 T \right)} \right\}} ~=~ \frac{ \beta D^2 }{ \sigma^2 d } ,$ which results in $R_T^{{\boldsymbol \nu}}~\le~ \frac{d}{2 \beta} {\left( 1 + \ln {\left( 1 + \frac{ \beta^2 D^2 G^2 }{ d^2 } T \right)} \right)} .$

# Discussion

We can see a few interesting things

• We find that Gradient Descent and Online Newton Step are instances of Exponential Weights. I find this intriguing, as in the past I regarded them as from the “Squared Euclidean” world, not from the “Shannon Entropy” world. Yet once we employ Gaussian weights it turns out the distinction vanishes. It would be interesting to see which tricks from the Entropy world can be ported over in this way.

• The algorithm is over-parametrised. We might as well have fixed the product of the learning rate $$\eta$$ and the prior covariance $${{\boldsymbol \Sigma}}$$.

• I was curious where the factor $$d$$ difference in the regret rate between the strongly convex and exp-concave cases comes from. I was thinking the answer had to be in one of the pieces $$\eqref{eq:expcomp}$$, $$\eqref{eq:cummixloss}$$ or $$\eqref{eq:kl}$$. But that is not quite right, as the “dangerous terms” in $$\eqref{eq:kl}$$ (those involving $${{\boldsymbol \Delta}}$$) actually cancel with fragments from both $$\eqref{eq:cummixloss}$$ and $$\eqref{eq:expcomp}$$. With the generic bound $$\eqref{eq:mainbound}$$ in place, I now see it comes from the difference in behaviour of the summands ${{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t ~=~ {{\boldsymbol g}}_t ({{\boldsymbol \Sigma}}_t^{-1} + \eta {{\boldsymbol M}}_t)^{-1} {{\boldsymbol g}}_t .$ This is the place where the difference in the form of $${{\boldsymbol M}}_t$$ kicks in. With $${{\boldsymbol M}}_t \propto {{\boldsymbol I}}$$ the norm of $${{\boldsymbol g}}_t$$ matters, which is independent of $$d$$. But with $${{\boldsymbol M}}_t \propto {{\boldsymbol g}}_t {{\boldsymbol g}}_t^{\intercal}$$ the log-determinant naturally arises, and we unavoidably pick up a factor of $$d$$.

Erven, Tim van, and Wouter M. Koolen. 2016. “MetaGrad: Multiple Learning Rates in Online Learning.” ArXiv, June. https://arxiv.org/pdf/1604.08740.

Hazan, Elad, Amit Agarwal, and Satyen Kale. 2007. “Logarithmic Regret Algorithms for Online Convex Optimization.” Machine Learning 69 (2-3). Kluwer Academic Publishers, Boston: 169–92. http://dx.doi.org/10.1007/s10994-007-5016-8.

Hoeven, Dirk van der. 2016. “Is Mirror Descent a Special Case of Exponential Weights?” Master’s thesis, Mathematical Institute, Leiden University.

Shalev-Shwartz, Shai. 2012. “Online Learning and Online Convex Optimization.” Foundations and Trends in Machine Learning 4 (2): 107–94.