# Setting

To keep things simple and focused, I will work in the setting of online linear optimisation without projections. Each round $$t$$ the learner plays a point $${\boldsymbol{w}}_t \in {\mathbb R}^d$$, the environment plays a loss coefficient vector $${\boldsymbol{g}}_t \in {\mathbb R}^d$$ and the learner incurs loss $${\boldsymbol{w}}_t^{\intercal}{\boldsymbol{g}}_t$$. The goal of the learner is to minimise the cumulative regret $R_T^{{\boldsymbol{u}}} ~=~ \sum_{t=1}^T ({\boldsymbol{w}}_t - {\boldsymbol{u}})^{\intercal}{\boldsymbol{g}}_t$ compared to any point $${\boldsymbol{u}}$$ in some pre-specified convex domain. In this post we will consider a simple yet typical domain, namely the Euclidean ball of radius $$D$$, i.e. $${\|{\boldsymbol{u}}\|} \le D$$.

Before we discuss AdaGrad, let us first review a simpler algorithm for online optimisation, namely Follow the Regularised Leader with squared Euclidean regularisation. After $$t$$ rounds, it plays \begin{equation}\label{eq:ftrl} {\boldsymbol{w}}_{t+1} ~=~ \operatorname*{argmin}_{{\boldsymbol{w}}}~ \sum_{s=1}^t {\boldsymbol{w}}^{\intercal}{\boldsymbol{g}}_s + \frac{{\|{\boldsymbol{w}}\|}^2}{2 \eta} , \end{equation} or, equivalently, it sequentially updates as ${\boldsymbol{w}}_{t+1} ~=~ {\boldsymbol{w}}_t - \eta {\boldsymbol{g}}_t$ which is called Online Gradient Descent (without projections and with a fixed learning rate $$\eta$$, FTRL and OGD are identical). The learning rate parameter $$\eta$$ needs to be tuned carefully to get a good bound. It can be shown that the scheme $$\eqref{eq:ftrl}$$ guarantees $R_T^{{\boldsymbol{u}}} ~\le~ \frac{{\|{\boldsymbol{u}}\|}^2}{2 \eta} + \frac{\eta}{2} \sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2 \qquad \text{for all {\boldsymbol{u}}}.$ The right-hand side expression is optimised in the learning rate $$\eta$$ at \begin{equation} \label{eq:GD.tune} \eta ~=~ \frac{ {\|{\boldsymbol{u}}\|} }{ \sqrt{ \sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2 } } \end{equation} which results in the bound \begin{equation} \label{eq:GD.bound} R_T^{{\boldsymbol{u}}} ~\le~ {\|{\boldsymbol{u}}\|} \sqrt{ \sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2 } . \end{equation} In practice this tuning $$\eqref{eq:GD.tune}$$ is inaccessible as we do not have prior knowledge of $${\|{\boldsymbol{u}}\|}$$ or $$\sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2$$. But we may get pretty close with actual algorithms. For example by tuning for a known upper bound $${\|{\boldsymbol{u}}\|} \le D$$, and by using the doubling trick on $$\sum_{t=1}^T {\|{\boldsymbol{g}}_t\|}^2$$. For smoother methods see e.g. (Orabona and Pál 2016).

The idea of AdaGrad (see (Hazan 2016 Section 5.6)) is that the regulariser $$\frac{1}{2} {\|{\boldsymbol{w}}\|}^2$$ in $$\eqref{eq:ftrl}$$ may not be the best choice for the problem at hand. So let’s replace the scaled Euclidean norm $$\eta^{-1} {\|{\boldsymbol{w}}\|}^2 = \eta^{-1} {\boldsymbol{w}}^{\intercal}{\boldsymbol{w}}$$ by the more general norm $${\|{\boldsymbol{w}}\|}_{{\boldsymbol{X}}}^2 = {\boldsymbol{w}}^{\intercal}{\boldsymbol{X}}^{-1} {\boldsymbol{w}}$$, where $${\boldsymbol{X}}\succeq {\boldsymbol{0}}$$ is now a PSD matrix parameter. Online Gradient Descent is recovered by setting $${\boldsymbol{X}}= \eta {\boldsymbol{I}}$$. With this upgrade, the learner plays \begin{equation}\label{eq:BentDescent} {\boldsymbol{w}}_{t+1} ~=~ \arg\min_{{\boldsymbol{w}}}~ \sum_{s=1}^t {\boldsymbol{g}}_s^{\intercal}{\boldsymbol{w}}+ \frac{1}{2} {\boldsymbol{w}}{\boldsymbol{X}}^{-1} {\boldsymbol{w}}~=~ - {\boldsymbol{X}}\sum_{s=1}^t {\boldsymbol{g}}_s \end{equation} or, equivalently, sequentially updates ${\boldsymbol{w}}_{t+1} ~=~ {\boldsymbol{w}}_t - {\boldsymbol{X}}{\boldsymbol{g}}_t .$ To see how we should tune $${\boldsymbol{X}}$$, let us analyse the regret of this scheme.

## Analysis

Let us introduce convenient abbreviations for the cumulative loss vector $${\boldsymbol{L}}= \sum_{t=1}^T {\boldsymbol{g}}_t$$ and outer product matrix $${\boldsymbol{G}}= \sum_{t=1}^T {\boldsymbol{g}}_t {\boldsymbol{g}}_t^{\intercal}$$. For any $${\boldsymbol{X}}$$, the loss of the scheme $$\eqref{eq:BentDescent}$$ is $\sum_{t=1}^T {\boldsymbol{w}}_t^{\intercal}{\boldsymbol{g}}_t ~=~ - \sum_{t=1}^T \sum_{s=1}^{t-1} {\boldsymbol{g}}_s^{\intercal}{\boldsymbol{X}}{\boldsymbol{g}}_t ~=~ \frac{1}{2} \sum_{t=1}^T {\boldsymbol{g}}_t^{\intercal}{\boldsymbol{X}}{\boldsymbol{g}}_t - \frac{1}{2} {\boldsymbol{L}}^{\intercal}{\boldsymbol{X}}{\boldsymbol{L}}~=~ \frac{1}{2} \operatorname{tr}{\left({\boldsymbol{X}}({\boldsymbol{G}}- {\boldsymbol{L}}{\boldsymbol{L}}^{\intercal})\right)}$ and hence the regret compared to any point $${\boldsymbol{u}}$$ is bounded by \begin{align} \notag R_T^{{\boldsymbol{u}}} ~=~ \sum_{t=1}^T ({\boldsymbol{w}}_t - {\boldsymbol{u}})^{\intercal}{\boldsymbol{g}}_t &~=~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}) - {\left({\boldsymbol{u}}^{\intercal}+ \frac{1}{2} {\boldsymbol{L}}^{\intercal}{\boldsymbol{X}}\right)} {\boldsymbol{L}}\\ \label{eq:regretbd} &~\le~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal})\end{align} where the inequality follows by plugging in the maximiser $${\boldsymbol{L}}= - {\boldsymbol{X}}^{-1} {\boldsymbol{u}}$$.

## Optimal tuning

Tuning $${\boldsymbol{X}}$$ optimally in $$\eqref{eq:regretbd}$$ results in regret at most \begin{align*} R_T^{{\boldsymbol{u}}} ~\le~ \min_{{\boldsymbol{X}}} ~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}) &~=~ \sqrt{{\boldsymbol{u}}^{\intercal}{\boldsymbol{G}}{\boldsymbol{u}}} \end{align*} where the optimiser is $${\boldsymbol{X}}= \frac{ {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}}{ \sqrt{{\boldsymbol{u}}^{\intercal}{\boldsymbol{G}}{\boldsymbol{u}}} }$$. This makes perfect sense. If you want to compare to $${\boldsymbol{u}}$$, you should ignore all other directions and only play in the $${\boldsymbol{u}}$$ direction. The problem becomes one-dimensional. This is not quite satisfying, as we need to know $${\boldsymbol{u}}$$ (and not just its norm) to be able to tune in this way. In the next sections we discuss tuning with less severe dependence on $${\boldsymbol{u}}$$.

Starting from the regret bound $$\eqref{eq:regretbd}$$, we may use that $${\|{\boldsymbol{u}}\|} \le D$$ so that $${\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}\preceq D^2 {\boldsymbol{I}}$$ and hence the regret is bounded by $R_T^{{\boldsymbol{u}}} ~\le~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}) ~\le~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ D^2 {\boldsymbol{X}}^{-1}) .$ Now tuning results in $R_T^{{\boldsymbol{u}}} ~\le~ \min_{{\boldsymbol{X}}}~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ D^2 {\boldsymbol{X}}^{-1}) ~=~ D \operatorname{tr}{\big({\boldsymbol{G}}^{1/2}\big)}$ with optimiser $${\boldsymbol{X}}= D {\boldsymbol{G}}^{-1/2}$$. As before, this optimal value for $${\boldsymbol{X}}$$ is not known a priori. Yet AdaGrad is an actual algorithm that achieves the above bound up to a factor $$\sqrt{2}$$ (see (Duchi, Hazan, and Singer 2011 Corollary 11) or (Hazan 2016 Theorem 5.9)). But is this bound desirable? How else could we tune?

## Minimax tuning

In the previous section we used $${\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}\preceq D^2 {\boldsymbol{I}}$$. This upper bound replaces a single eigenvalue $$D^2$$ with $$d$$ many eigenvalues $$D^2$$. So intuitively, it may result in suboptimal dependence on the dimension. To tighten the tuning, let’s take a minimax approach instead. That is, we tune $$\eqref{eq:regretbd}$$ as follows: \begin{align*} R_T^{{\boldsymbol{u}}} &~\le~ \min_{{\boldsymbol{X}}} \max_{{\boldsymbol{u}}: {\|{\boldsymbol{u}}\|} \le D}~ \frac{1}{2} \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}+ {\boldsymbol{X}}^{-1} {\boldsymbol{u}}{\boldsymbol{u}}^{\intercal}) \\ &~=~ \min_{{\boldsymbol{X}}}~ \frac{1}{2} {\left( \operatorname{tr}({\boldsymbol{X}}{\boldsymbol{G}}) + D^2 \lambda_{\max}({\boldsymbol{X}}^{-1}) \right)} \\ &~=~ \min_{\eta}~ \frac{1}{2} {\left( \eta \operatorname{tr}({\boldsymbol{G}}) + \frac{D^2}{\eta} \right)} \\ &~=~ D \sqrt{ \operatorname{tr}({\boldsymbol{G}})} \end{align*} where the third line is obtained by observing that the optimal matrix $${\boldsymbol{X}}$$ will have all eigenvalues equal, i.e. $${\boldsymbol{X}}= \eta {\boldsymbol{I}}$$ and the fourth line is obtained by plugging in the optimiser $$\eta = \frac{D}{\sqrt{\operatorname{tr}({\boldsymbol{G}})}}$$. Surprisingly, this is the Online Gradient Descent tuning $$\eqref{eq:GD.tune}$$ and its bound $$\eqref{eq:GD.bound}$$ with $${\|{\boldsymbol{u}}\|} \le D$$. Hence Online Gradient Descent (with doubling trick) achieves this bound up to a small constant factor.

# Discussion

Online Gradient Descent has regret of order $$D \sqrt{\operatorname{tr}({\boldsymbol{G}})}$$ (e.g. using a doubling trick). Full AdaGrad has regret of order $$D \operatorname{tr}(\sqrt{{\boldsymbol{G}}})$$. The latter is always at least as high as the former (by concavity of the square root) and it may be a factor $$\sqrt{d}$$ higher (in case $${\boldsymbol{G}}\propto {\boldsymbol{I}}$$).

So what is the benefit of Full AdaGrad over Online Gradient Descent? For linear losses (the simplest case) I do not know. It may exist. But I have not found any regret analysis bringing it out. If you do, please let me know. I would be very interested.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59. http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.

Hazan, Elad. 2016. “Introduction to Online Convex Optimization.” Foundations and Trends in Optimization 2 (3–4). Now Publishers Inc.: 157–325. doi:10.1561/2400000013.

Orabona, Francesco, and Dávid Pál. 2016. “Scale-Free Online Learning.” CoRR abs/1601.01974. http://arxiv.org/abs/1601.01974.