Introduction

There are many algorithms in Online Learning. The list includes Gradient Descent, Exponential Weights, Mirror Descent, Follow the Regularized Leader and more (see e.g. (Shalev-Shwartz 2012)). So it is always good to see how these algorithms relate, and to judge how different they really are. In this post we look at three basic algorithms and show that they are in fact obtained from the same master algorithm. That is, we show that

  1. Gradient Descent (tuned for convex loss functions),

  2. Gradient Descent (tuned for strongly convex loss functions), and

  3. Online Newton Step (for exp-concave loss functions)

are all instances of Exponential Weights with certain so-called surrogate losses. We give a uniform derivation of the algorithm and its regret analysis, and recover the known bounds in each case.

I addressed the first case above in this earlier post, and encountered the connection with the third case while working on (Erven and Koolen 2016). A similar connection between Exponential Weights and the Mirror Descent family of algorithms is studied by (Hoeven 2016).

Setting: Online Optimisation

We look at online optimisation over a convex set \({{\mathcal U}}\subseteq {{\mathbb R}}^d\). Learning proceeds in rounds. In round \(t\) the learner plays a point \({{\boldsymbol w}}_t \in {{\mathcal U}}\). Then the adversary reveals a loss function \(f_t : {{\mathcal U}}\to {{\mathbb R}}\), upon which the learner incurs loss \(f_t({{\boldsymbol w}}_t)\). We evaluate the learner by its so-called regret. After \(T\) rounds, the regret compared to a point \({{\boldsymbol u}}\in {{\mathcal U}}\) is defined by \begin{equation}\label{eq:regret} R_T^{{\boldsymbol u}}~{:=}~ \sum_{t=1}^T {\big(f_t({{\boldsymbol w}}_t) - f_t({{\boldsymbol u}})\big)} . \end{equation} The goal of the learner is to ensure small regret compared to all \({{\boldsymbol u}}\in {{\mathcal U}}\).

Assumption: Quadratic Lower Bound

We will now assume that the functions \(f_t\) have a particular structure, namely that we can lower bound them by a quadratic. Recall that in round \(t\) the learner plays \({{\boldsymbol w}}_t \in {{\mathcal U}}\) and encounters loss function \(f_t : {{\mathcal U}}\to {{\mathbb R}}\). We assume that the learner is given a vector \({{\boldsymbol g}}_t \in {{\mathbb R}}^d\) (typically \({{\boldsymbol g}}_t = \nabla f_t({{\boldsymbol w}}_t)\)) and matrix \({{\boldsymbol M}}_t \succeq {{\boldsymbol 0}}\) such that for all \({{\boldsymbol u}}\in {{\mathcal U}}\) \begin{equation}\label{eq:assn} f_t({{\boldsymbol u}}) ~\ge~ {\ell}_t({{\boldsymbol u}}) , \end{equation} where \({\ell}_t\) is the convex quadratic surrogate loss defined by \[{\ell}_t({{\boldsymbol u}}) ~{:=}~ f_t({{\boldsymbol w}}_t) + {\left\langle{{\boldsymbol u}}-{{\boldsymbol w}}_t, {{\boldsymbol g}}_t\right\rangle} + \frac{1}{2} ({{\boldsymbol u}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol w}}_t) .\] Note that by construction \(f_t({{\boldsymbol w}}_t) = {\ell}_t({{\boldsymbol w}}_t)\) is tight. Hence the regret \(R_T^{{\boldsymbol u}}\) can be bounded by the surrogate regret \(\tilde R_T^{{\boldsymbol u}}\), i.e.  \begin{equation}\label{eq:surr.vs.real.regret} R_T^{{\boldsymbol u}}~=~ \sum_{t=1}^T {\left(f_t({{\boldsymbol w}}_t) - f_t({{\boldsymbol u}})\right)} ~\le~ \sum_{t=1}^T {\left({\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol u}})\right)} ~=~ \tilde R_T^{{\boldsymbol u}}. \end{equation} The following figure illustrates the loss function and the surrogate regret at various points \({{\boldsymbol w}}\) in \(d=1\) dimension.

Example loss function (solid) and quadratic lower bounds (dashed).

Example loss function (solid) and quadratic lower bounds (dashed).

Now how to control that surrogate regret \(\tilde R_T^{{\boldsymbol u}}\)? Let’s play standard Exponential Weights! Hold on, why not play Exponential Weights on the original losses \(f_t\) directly? Well, the cool thing about Exponential Weights on the quadratic surrogate losses \({\ell}_t\) (and with Gaussian prior) is that all quantities of interest can be computed exactly in closed form. This is a major computational advantage, and it also provides more explicit insight into what the algorithm “thinks”.

Exponential Weights

We now analyze the behavior of the standard Exponential Weights (technically the exponentially weighted average forecaster) applied to the surrogate loss \({\ell}_t\). The Exponential Weights strategy will produce a sequence of distributions \(P_1, P_2, \ldots\) on \({{\mathbb R}}^d\). It will be convenient to maintain weights on all of \({{\mathbb R}}^d\) instead of only on \({{\mathcal U}}\), but we will ensure that the each \(P_t\) has its mean in \({{\mathcal U}}\). We will start out with Gaussian prior \[P_1 ~{:=}~ {{\mathcal N}}({{\boldsymbol 0}}, {{\boldsymbol \Sigma}}) ,\] with zero mean (here we assume that \({{\boldsymbol 0}}\in {{\mathcal U}}\)) and covariance \({{\boldsymbol \Sigma}}\succeq {{\boldsymbol 0}}\). Then in each round \(t\) we play the mean of \(P_t\) \[{{\boldsymbol w}}_t ~{:=}~ \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} {\left[{{\boldsymbol u}}\right]} .\] Playing the mean is a better idea than randomisation since \({\ell}_t({{\boldsymbol w}}_t) \le \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})}{\left[{\ell}_t({{\boldsymbol u}})\right]}\) by Jensen’s inequality. Exponential Weights subsequently updates the distribution to \[P_{t+1} ~{:=}~ \min_{P \in \mathcal P}~ \operatorname*{KL}(P\|P_t) + \eta \operatorname*{\mathbb E}_{P({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol u}})\right]} ,\] where \[\mathcal P ~{:=}~ {\left\{P\middle|\operatorname*{\mathbb E}_{P({{\boldsymbol u}})}[{{\boldsymbol u}}] \in {{\mathcal U}}\right\}}\] is the set of distributions on \({{\mathbb R}}^d\) with mean in \({{\mathcal U}}\) and \(\eta > 0\) is the learning rate (a parameter we will tune below). It is customary and convenient to equivalently decompose the update in the following two steps \[P_{t+1} ~=~ \min_{P \in \mathcal P}~ \operatorname*{KL}(P\|\tilde P_{t+1}) \qquad \text{where} \qquad \tilde P_{t+1} ~{:=}~ \min_{P}~ \operatorname*{KL}(P\|P_t) + \eta \operatorname*{\mathbb E}_{P({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol u}})\right]} .\] We then find that both \(P_{t+1}\) and \(\tilde P_{t+1}\) are Gaussian, in particular \[P_{t+1} ~=~ {{\mathcal N}}({{\boldsymbol w}}_{t+1}, {{\boldsymbol \Sigma}}_{t+1}) \qquad \text{and} \qquad \tilde P_{t+1} ~=~ {{\mathcal N}}(\tilde {{\boldsymbol w}}_{t+1}, {{\boldsymbol \Sigma}}_{t+1}) ,\] where \begin{align*} {{\boldsymbol \Sigma}}_{t+1}^{-1} &~=~ {{\boldsymbol \Sigma}}_t^{-1} + \eta {{\boldsymbol M}}_t \\ \tilde {{\boldsymbol w}}_{t+1} &~=~ {{\boldsymbol w}}_t - \eta {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t \\ {{\boldsymbol w}}_{t+1} &~=~ \operatorname*{argmin}_{{{\boldsymbol w}}\in {{\mathcal U}}}~ ({{\boldsymbol w}}- \tilde {{\boldsymbol w}}_{t+1})^{\intercal}{{\boldsymbol \Sigma}}_{t+1}^{-1} ({{\boldsymbol w}}-\tilde {{\boldsymbol w}}_{t+1}) \end{align*} So far so good. Even though the Exponential Weights algorithm maintains distributions \(P_t\) on \({{\mathbb R}}^d\) (which could have arbitrarily high complexity), it “collapses”, meaning that it can be implemented by maintaining just \({{\boldsymbol w}}_t\) and \({{\boldsymbol \Sigma}}_t\), which require at most \(d\) and \(d^2\) parameters respectively. This is the big advantage of working with a quadratic loss and a Guassian prior.

Regret Analysis

We now perform the standard analysis of the Exponential Weights algorithm. We start by observing that \(\operatorname*{KL}\) is non-negative and satisfies the Pythagorean Inequality. Hence for any \(Q \in \mathcal P\) \begin{align*} \operatorname*{KL}(Q\|P_1) &~\ge~ \operatorname*{KL}(Q\|P_1) - \operatorname*{KL}(Q\|P_{T+1}) \\ &~=~ \sum_{t=1}^T {\big( \operatorname*{KL}(Q\|P_t) - \operatorname*{KL}(Q\|P_{t+1})\big)} \\ &~\ge~ \sum_{t=1}^T {\big( \operatorname*{KL}(Q\|P_t) - \operatorname*{KL}(Q\|\tilde P_{t+1})\big)} \\ &~=~ \sum_{t=1}^T {\left( \eta \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol u}})\right]} - \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} \right)} . \end{align*} Interestingly, these are the only two bounds necessary to analyse the surrogate regret. In the remainder we will merely simplify with equality. Reorganising the above yields \begin{equation}\label{eq:surr.regret} \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[\tilde R_T^{{\boldsymbol u}}\right]} ~\le~ \frac{ \operatorname*{KL}(Q\|P_1) + \sum_{t=1}^T \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} }{ \eta } . \end{equation} It will be sufficient to consider Gaussian comparators \[Q ~=~ {{\mathcal N}}{({{\boldsymbol \nu}}, {{\boldsymbol \Delta}})}\] with mean \({{\boldsymbol \nu}}\in {{\mathcal U}}\) and covariance \({{\boldsymbol \Delta}}\succeq {{\boldsymbol 0}}\). We now separately expand the three major sub-expressions of \(\eqref{eq:surr.regret}\). First, \begin{align*} &\qquad \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[{\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol u}})\right]} \\ &~=~ \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[ - {\left\langle{{\boldsymbol u}}-{{\boldsymbol w}}_t, {{\boldsymbol g}}_t\right\rangle} - \frac{1}{2} ({{\boldsymbol u}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol w}}_t) \right]} \\ &~=~ - {\left\langle{{\boldsymbol \nu}}-{{\boldsymbol w}}_t, {{\boldsymbol g}}_t\right\rangle} - \frac{1}{2} ({{\boldsymbol \nu}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol \nu}}-{{\boldsymbol w}}_t) - \frac{1}{2} \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[ ({{\boldsymbol u}}-{{\boldsymbol \nu}})^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol \nu}}) \right]} \\ &~=~ {\ell}_t({{\boldsymbol w}}_t) - {\ell}_t({{\boldsymbol \nu}}) - \frac{1}{2} \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol M}}_t\right)} \end{align*} from which it follows that \begin{equation}\label{eq:expcomp} \operatorname*{\mathbb E}_{Q({{\boldsymbol u}})} {\left[\tilde R_T^{{\boldsymbol u}}\right]} ~=~ \tilde R_T^{{\boldsymbol \nu}}- \frac{1}{2} \operatorname*{tr}{\left({{\boldsymbol \Delta}}\sum_{t=1}^T {{\boldsymbol M}}_t\right)} . \end{equation} Second, since \(P_t = {{\mathcal N}}({{\boldsymbol w}}_t, {{\boldsymbol \Sigma}}_t)\), we have \begin{align*} &\qquad \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} \\ &~=~ \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta {\left( {\left\langle{{\boldsymbol w}}_t-{{\boldsymbol u}}, {{\boldsymbol g}}_t\right\rangle} - \frac{1}{2} ({{\boldsymbol u}}-{{\boldsymbol w}}_t)^{\intercal}{{\boldsymbol M}}_t ({{\boldsymbol u}}-{{\boldsymbol w}}_t) \right)} )} \\ &~=~ \frac{\eta^2}{2} {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t - \frac{1}{2} \ln \frac{ |{{\boldsymbol \Sigma}}_t| }{ |{{\boldsymbol \Sigma}}_{t+1}| } , \end{align*} so that \begin{equation}\label{eq:cummixloss} \sum_{t=1}^T \ln \operatorname*{\mathbb E}_{P_t({{\boldsymbol u}})} e^{\eta ({\ell}_t({{\boldsymbol w}}_t)- {\ell}_t({{\boldsymbol u}}))} ~=~ \frac{\eta^2}{2} \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t - \frac{1}{2} \ln \frac{ |{{\boldsymbol \Sigma}}| }{ |{{\boldsymbol \Sigma}}_{T+1}| } . \end{equation} Finally, since \(Q = {{\mathcal N}}({{\boldsymbol \nu}}, {{\boldsymbol \Delta}})\) and \(P_1 = {{\mathcal N}}({{\boldsymbol 0}}, {{\boldsymbol \Sigma}})\) are Gaussian, \begin{equation}\label{eq:kl} \operatorname*{KL}{(Q\|P_1)} ~=~ \frac{1}{2} {\left( \ln \frac{|{{\boldsymbol \Sigma}}|}{|{{\boldsymbol \Delta}}|} + \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol \Sigma}}^{-1}\right)} - d + {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}\right)} . \end{equation} Substituting the three equalities \(\eqref{eq:expcomp}\), \(\eqref{eq:cummixloss}\) and \(\eqref{eq:kl}\) into our bound \(\eqref{eq:surr.regret}\) results in \[\tilde R_T^{{\boldsymbol \nu}}- \frac{1}{2} \operatorname*{tr}{\left({{\boldsymbol \Delta}}\sum_{t=1}^T {{\boldsymbol M}}_t\right)} ~\le~ \frac{ \ln \frac{|{{\boldsymbol \Sigma}}|}{|{{\boldsymbol \Delta}}|} + \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol \Sigma}}^{-1}\right)} - d + {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \eta^2 \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t - \ln \frac{ |{{\boldsymbol \Sigma}}| }{ |{{\boldsymbol \Sigma}}_{T+1}| } }{ 2 \eta } ,\] that is (using \({{\boldsymbol \Sigma}}_{T+1}^{-1} = {{\boldsymbol \Sigma}}^{-1} + \eta \sum_{t=1}^T {{\boldsymbol M}}_t\)) \[\tilde R_T^{{\boldsymbol \nu}}~\le~ \frac{ \ln \frac{|{{\boldsymbol \Sigma}}_{T+1}|}{|{{\boldsymbol \Delta}}|} + \operatorname*{tr}{\left({{\boldsymbol \Delta}}{{\boldsymbol \Sigma}}_{T+1}^{-1}\right)} - d + {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \eta^2 \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t }{ 2 \eta } .\] As this holds for all \({{\boldsymbol \Delta}}\), we may use the optimal value, which we find by setting the derivative to zero, giving \[{{\boldsymbol \Delta}}~=~ {{\boldsymbol \Sigma}}_{T+1} ,\] and plugging that in simplifies the expression considerably, yielding \[\tilde R_T^{{\boldsymbol \nu}}~\le~ \frac{1}{2 \eta} {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \frac{\eta}{2} \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t .\] Finally, since \(R_T^{{\boldsymbol u}}\le \tilde R_T^{{\boldsymbol u}}\) for all \({{\boldsymbol u}}\in {{\mathcal U}}\) by \(\eqref{eq:surr.vs.real.regret}\), we obtain regret bound \begin{equation}\label{eq:mainbound} R_T^{{\boldsymbol \nu}}~\le~ \frac{1}{2 \eta} {{\boldsymbol \nu}}^{\intercal}{{\boldsymbol \Sigma}}^{-1} {{\boldsymbol \nu}}+ \frac{\eta}{2} \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t . \end{equation} The more variance \({{\boldsymbol \Sigma}}_{t+1}\) captures (i.e. the smaller it is), the better the second term. But to make \({{\boldsymbol \Sigma}}_{t+1}\) small we only have one knob, namely starting with a small \({{\boldsymbol \Sigma}}\). We then see that there is a trade-off, as small \({{\boldsymbol \Sigma}}\) will blow up the first term. The next step toward making the bound explicit depends on the details of the scenario. Next we consider the three cases mentioned in the introduction.

Applications

We now recover the three algorithms and their analysis from the introduction. Throughout we take spherical prior covariance \({{\boldsymbol \Sigma}}= \sigma^2{{\boldsymbol I}}\) and use \({{\boldsymbol g}}_t = \nabla f_t({{\boldsymbol w}}_t)\). So the three cases only differ in their choice of \({{\boldsymbol M}}_t\). Let us also make the standard boundedness assumptions \({\|{{\boldsymbol g}}_t\|} \le G\) and \({\|{{\boldsymbol \nu}}\|} \le D\) for each \({{\boldsymbol \nu}}\in {{\mathcal U}}\).

Convex

For convex loss functions \(f_t\) we may satisfy \(\eqref{eq:assn}\) by taking \({{\boldsymbol M}}_t = {{\boldsymbol 0}}\). We then find \({{\boldsymbol \Sigma}}_{t+1} = {{\boldsymbol \Sigma}}\), indicating that the algorithm, which is Gradient Descent, does not learn the covariance. Then by \(\eqref{eq:mainbound}\) \begin{align*} R_T^{{\boldsymbol \nu}}&~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{\eta \sigma^2}{2} \sum_{t=1}^T {\|{{\boldsymbol g}}_t\|}^2 \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{\eta \sigma^2}{2} G^2 T . \end{align*} This bound is optimized for \[\eta ~=~ \frac{D}{\sigma^2 G \sqrt{T}} ,\] where it reads \[R_T^{{\boldsymbol \nu}}~\le~ G D \sqrt{T} .\]

Strongly Convex

For \(\beta\)-strongly convex loss functions \(f_t\), we may take \({{\boldsymbol M}}_t = \beta {{\boldsymbol I}}\) to satisfy \(\eqref{eq:assn}\). We then find \({{\boldsymbol \Sigma}}_{t+1} = \frac{1}{\frac{1}{\sigma^2} + \beta \eta t}{{\boldsymbol I}}\), a scaled-down identity. The fact that the covariance matrix does not depend on the data but only on the index of the current round indicates that we are running Gradient Descent with a “time-varying learning rate” of order \(1/t\). Plugging this into \(\eqref{eq:mainbound}\), we find \begin{align*} R_T^{{\boldsymbol \nu}}&~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{\eta}{2} \sum_{t=1}^T \frac{1}{\frac{1}{\sigma^2} + \beta \eta t} {\|{{\boldsymbol g}}_t\|}^2 \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{\eta G^2}{2} \sum_{t=1}^T \frac{1}{\frac{1}{\sigma^2} + \beta \eta t} \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{\eta G^2}{2} \int_0^T \frac{1}{\frac{1}{\sigma^2} + \beta \eta t} {\,\textrm{d}}t \\ &~=~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{1}{2 \beta} G^2 \ln {\left(1 + \beta \eta \sigma^2 T\right)} . \end{align*} This suggests tuning \[\eta ~=~ \operatorname*{argmin}_\eta~ {\left\{ \frac{1}{2 \eta \sigma^2} D^2 + \frac{1}{2 \beta} G^2 \ln {\left(\beta \eta \sigma^2 T\right)} \right\}} ~=~ \frac{\beta D^2}{\sigma^2 G^2} ,\] resulting in \[R_T^{{\boldsymbol \nu}}~\le~ \frac{G^2}{2 \beta} {\left( 1 + \ln {\left(1 + \frac{\beta^2 D^2}{G^2} T\right)} \right)} .\]

Exp-concave

For \(\beta\)-exp-concave loss functions \(f_t\), we may satisfy assumption \(\eqref{eq:assn}\) by taking \({{\boldsymbol M}}_t = \beta {{\boldsymbol g}}_t {{\boldsymbol g}}_t^{\intercal}\). We then find that the algorithm is Online Newton Step by (Hazan, Agarwal, and Kale 2007). To analyse its regret, we invoke concavity of the log-determinant to obtain \[\ln |{{\boldsymbol \Sigma}}_{t+1}^{-1}| - \ln |{{\boldsymbol \Sigma}}_t^{-1}| ~\ge~ \eta \operatorname*{tr}{\left( {{\boldsymbol M}}_t {{\boldsymbol \Sigma}}_{t+1} \right)} ~=~ \beta \eta {{\boldsymbol g}}_t^{\intercal}{{\boldsymbol \Sigma}}_{t+1} {{\boldsymbol g}}_t .\] Applying this to \(\eqref{eq:mainbound}\) we conclude \begin{align*} R_T^{{\boldsymbol \nu}}&~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{1}{2 \beta} \ln \det {\left({{\boldsymbol I}}+ \beta \eta \sigma^2 \sum_{t=1}^T {{\boldsymbol g}}_t {{\boldsymbol g}}_t^{\intercal}\right)} \\ &~\le~ \frac{1}{2 \eta \sigma^2} {\|{{\boldsymbol \nu}}\|}^2 + \frac{d}{2 \beta} \ln {\left( 1 + \frac{\beta \eta \sigma^2}{d} \sum_{t=1}^T {\left\|{{\boldsymbol g}}_t\right\|}^2 \right)} \\ &~\le~ \frac{1}{2 \eta \sigma^2} D^2 + \frac{d}{2 \beta} \ln {\left( 1 + \frac{\beta \eta \sigma^2}{d} G^2 T \right)} . \end{align*} This suggest tuning \[\eta ~=~ \operatorname*{argmin}_\eta~ {\left\{ \frac{1}{2 \eta \sigma^2} D^2 + \frac{d}{2 \beta} \ln {\left( \frac{\beta \eta \sigma^2}{d} G^2 T \right)} \right\}} ~=~ \frac{ \beta D^2 }{ \sigma^2 d } ,\] which results in \[R_T^{{\boldsymbol \nu}}~\le~ \frac{d}{2 \beta} {\left( 1 + \ln {\left( 1 + \frac{ \beta^2 D^2 G^2 }{ d^2 } T \right)} \right)} .\]

Discussion

We can see a few interesting things

Erven, Tim van, and Wouter M. Koolen. 2016. “MetaGrad: Multiple Learning Rates in Online Learning.” ArXiv, June. https://arxiv.org/pdf/1604.08740.

Hazan, Elad, Amit Agarwal, and Satyen Kale. 2007. “Logarithmic Regret Algorithms for Online Convex Optimization.” Machine Learning 69 (2-3). Kluwer Academic Publishers, Boston: 169–92. http://dx.doi.org/10.1007/s10994-007-5016-8.

Hoeven, Dirk van der. 2016. “Is Mirror Descent a Special Case of Exponential Weights?” Master’s thesis, Mathematical Institute, Leiden University.

Shalev-Shwartz, Shai. 2012. “Online Learning and Online Convex Optimization.” Foundations and Trends in Machine Learning 4 (2): 107–94.