2017-01-21

Abstract

AdaGrad, an adaptive version of Online Gradient Descent, is often used in its diagonal version. In this post we take a look at the full matrix version to understand the benefit of its additional adaptive power.

# Setting

To keep things simple and focused, I will work in the setting of online linear optimisation without projections. Each round $$t$$ the learner plays a point $$\boldsymbol{w}_t \in \mathbb R^d$$, the environment plays a loss coefficient vector $$\boldsymbol{g}_t \in \mathbb R^d$$ and the learner incurs loss $$\boldsymbol{w}_t^\intercal\boldsymbol{g}_t$$. The goal of the learner is to minimise the cumulative regret $R_T^{\boldsymbol{u}} ~=~ \sum_{t=1}^T (\boldsymbol{w}_t - \boldsymbol{u})^\intercal\boldsymbol{g}_t$ compared to any point $$\boldsymbol{u}$$ in some pre-specified convex domain. In this post we will consider a simple yet typical domain, namely the Euclidean ball of radius $$D$$, i.e. $$\|\boldsymbol{u}\| \le D$$.

Before we discuss AdaGrad, let us first review a simpler algorithm for online optimisation, namely Follow the Regularised Leader with squared Euclidean regularisation. After $$t$$ rounds, it plays \begin{equation}\label{eq:ftrl} \boldsymbol{w}_{t+1} ~=~ \mathop{\mathrm{argmin}}_{\boldsymbol{w}}~ \sum_{s=1}^t \boldsymbol{w}^\intercal\boldsymbol{g}_s + \frac{\|\boldsymbol{w}\|^2}{2 \eta} , \end{equation} or, equivalently, it sequentially updates as $\boldsymbol{w}_{t+1} ~=~ \boldsymbol{w}_t - \eta \boldsymbol{g}_t$ which is called Online Gradient Descent (without projections and with a fixed learning rate $$\eta$$, FTRL and OGD are identical). The learning rate parameter $$\eta$$ needs to be tuned carefully to get a good bound. It can be shown that the scheme $$\eqref{eq:ftrl}$$ guarantees $R_T^{\boldsymbol{u}} ~\le~ \frac{\|\boldsymbol{u}\|^2}{2 \eta} + \frac{\eta}{2} \sum_{t=1}^T \|\boldsymbol{g}_t\|^2 \qquad \text{for all \boldsymbol{u}}.$ The right-hand side expression is optimised in the learning rate $$\eta$$ at \begin{equation} \label{eq:GD.tune} \eta ~=~ \frac{ \|\boldsymbol{u}\| }{ \sqrt{ \sum_{t=1}^T \|\boldsymbol{g}_t\|^2 } } \end{equation} which results in the bound \begin{equation} \label{eq:GD.bound} R_T^{\boldsymbol{u}} ~\le~ \|\boldsymbol{u}\| \sqrt{ \sum_{t=1}^T \|\boldsymbol{g}_t\|^2 } . \end{equation} In practice this tuning $$\eqref{eq:GD.tune}$$ is inaccessible as we do not have prior knowledge of $$\|\boldsymbol{u}\|$$ or $$\sum_{t=1}^T \|\boldsymbol{g}_t\|^2$$. But we may get pretty close with actual algorithms. For example by tuning for a known upper bound $$\|\boldsymbol{u}\| \le D$$, and by using the doubling trick on $$\sum_{t=1}^T \|\boldsymbol{g}_t\|^2$$. For smoother methods see e.g. (Orabona and Pál 2016).

The idea of AdaGrad (see (Hazan 2016, sec. 5.6)) is that the regulariser $$\frac{1}{2} \|\boldsymbol{w}\|^2$$ in $$\eqref{eq:ftrl}$$ may not be the best choice for the problem at hand. So let’s replace the scaled Euclidean norm $$\eta^{-1} \|\boldsymbol{w}\|^2 = \eta^{-1} \boldsymbol{w}^\intercal\boldsymbol{w}$$ by the more general norm $$\|\boldsymbol{w}\|_{\boldsymbol{X}}^2 = \boldsymbol{w}^\intercal\boldsymbol{X}^{-1} \boldsymbol{w}$$, where $$\boldsymbol{X}\succeq \boldsymbol{0}$$ is now a PSD matrix parameter. Online Gradient Descent is recovered by setting $$\boldsymbol{X}= \eta \boldsymbol{I}$$. With this upgrade, the learner plays \begin{equation}\label{eq:BentDescent} \boldsymbol{w}_{t+1} ~=~ \arg\min_{\boldsymbol{w}}~ \sum_{s=1}^t \boldsymbol{g}_s^\intercal\boldsymbol{w}+ \frac{1}{2} \boldsymbol{w}\boldsymbol{X}^{-1} \boldsymbol{w} ~=~ - \boldsymbol{X}\sum_{s=1}^t \boldsymbol{g}_s \end{equation} or, equivalently, sequentially updates $\boldsymbol{w}_{t+1} ~=~ \boldsymbol{w}_t - \boldsymbol{X}\boldsymbol{g}_t .$ To see how we should tune $$\boldsymbol{X}$$, let us analyse the regret of this scheme.

## Analysis

Let us introduce convenient abbreviations for the cumulative loss vector $$\boldsymbol{L}= \sum_{t=1}^T \boldsymbol{g}_t$$ and outer product matrix $$\boldsymbol{G}= \sum_{t=1}^T \boldsymbol{g}_t \boldsymbol{g}_t^\intercal$$. For any $$\boldsymbol{X}$$, the loss of the scheme $$\eqref{eq:BentDescent}$$ is $\sum_{t=1}^T \boldsymbol{w}_t^\intercal\boldsymbol{g}_t ~=~ - \sum_{t=1}^T \sum_{s=1}^{t-1} \boldsymbol{g}_s^\intercal\boldsymbol{X}\boldsymbol{g}_t ~=~ \frac{1}{2} \sum_{t=1}^T \boldsymbol{g}_t^\intercal\boldsymbol{X}\boldsymbol{g}_t - \frac{1}{2} \boldsymbol{L}^\intercal\boldsymbol{X}\boldsymbol{L} ~=~ \frac{1}{2} \mathop{\mathrm{tr}}\left(\boldsymbol{X}(\boldsymbol{G}- \boldsymbol{L}\boldsymbol{L}^\intercal)\right)$ and hence the regret compared to any point $$\boldsymbol{u}$$ is bounded by \begin{align} \notag R_T^{\boldsymbol{u}} ~=~ \sum_{t=1}^T (\boldsymbol{w}_t - \boldsymbol{u})^\intercal\boldsymbol{g}_t &~=~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}) - \left(\boldsymbol{u}^\intercal+ \frac{1}{2} \boldsymbol{L}^\intercal\boldsymbol{X}\right) \boldsymbol{L} \\ \label{eq:regretbd} &~\le~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}+ \boldsymbol{X}^{-1} \boldsymbol{u}\boldsymbol{u}^\intercal)\end{align} where the inequality follows by plugging in the maximiser $$\boldsymbol{L}= - \boldsymbol{X}^{-1} \boldsymbol{u}$$.

## Optimal tuning

Tuning $$\boldsymbol{X}$$ optimally in $$\eqref{eq:regretbd}$$ results in regret at most \begin{align*} R_T^{\boldsymbol{u}} ~\le~ \min_{\boldsymbol{X}} ~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}+ \boldsymbol{X}^{-1} \boldsymbol{u}\boldsymbol{u}^\intercal) %\\ %&~=~ %\min_\alpha~ %\frac{1}{2} \tr\dell{\alpha \frac{\u \u^\top}{\u^\top \u} \G + \alpha^{-1} %\frac{\u \u^\top}{\u^\top \u} \u \u^\top} %\\ &~=~ \sqrt{\boldsymbol{u}^\intercal\boldsymbol{G}\boldsymbol{u}} \end{align*} where the optimiser is $$\boldsymbol{X}= \frac{ \boldsymbol{u}\boldsymbol{u}^\intercal }{ \sqrt{\boldsymbol{u}^\intercal\boldsymbol{G}\boldsymbol{u}} }$$. This makes perfect sense. If you want to compare to $$\boldsymbol{u}$$, you should ignore all other directions and only play in the $$\boldsymbol{u}$$ direction. The problem becomes one-dimensional. This is not quite satisfying, as we need to know $$\boldsymbol{u}$$ (and not just its norm) to be able to tune in this way. In the next sections we discuss tuning with less severe dependence on $$\boldsymbol{u}$$.

Starting from the regret bound $$\eqref{eq:regretbd}$$, we may use that $$\|\boldsymbol{u}\| \le D$$ so that $$\boldsymbol{u}\boldsymbol{u}^\intercal\preceq D^2 \boldsymbol{I}$$ and hence the regret is bounded by $R_T^{\boldsymbol{u}} ~\le~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}+ \boldsymbol{X}^{-1} \boldsymbol{u}\boldsymbol{u}^\intercal) ~\le~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}+ D^2 \boldsymbol{X}^{-1}) .$ Now tuning results in $R_T^{\boldsymbol{u}} ~\le~ \min_{\boldsymbol{X}}~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}+ D^2 \boldsymbol{X}^{-1}) ~=~ D \mathop{\mathrm{tr}}\big(\boldsymbol{G}^{1/2}\big)$ with optimiser $$\boldsymbol{X} = D \boldsymbol{G}^{-1/2}$$. As before, this optimal value for $$\boldsymbol{X}$$ is not known a priori. Yet AdaGrad is an actual algorithm that achieves the above bound up to a factor $$\sqrt{2}$$ (see (Duchi, Hazan, and Singer 2011, Corollary 11) or (Hazan 2016, Theorem 5.9)). But is this bound desirable? How else could we tune?

## Minimax tuning

In the previous section we used $$\boldsymbol{u}\boldsymbol{u}^\intercal\preceq D^2 \boldsymbol{I}$$. This upper bound replaces a single eigenvalue $$D^2$$ with $$d$$ many eigenvalues $$D^2$$. So intuitively, it may result in suboptimal dependence on the dimension. To tighten the tuning, let’s take a minimax approach instead. That is, we tune $$\eqref{eq:regretbd}$$ as follows: \begin{align*} R_T^{\boldsymbol{u}} &~\le~ \min_{\boldsymbol{X}} \max_{\boldsymbol{u}: \|\boldsymbol{u}\| \le D}~ \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}+ \boldsymbol{X}^{-1} \boldsymbol{u}\boldsymbol{u}^\intercal) \\ &~=~ \min_{\boldsymbol{X}}~ \frac{1}{2} \left( \mathop{\mathrm{tr}}(\boldsymbol{X}\boldsymbol{G}) + D^2 \lambda_{\max}(\boldsymbol{X}^{-1}) \right) \\ &~=~ \min_{\eta}~ \frac{1}{2} \left( \eta \mathop{\mathrm{tr}}(\boldsymbol{G}) + \frac{D^2}{\eta} \right) \\ &~=~ D \sqrt{ \mathop{\mathrm{tr}}(\boldsymbol{G})} \end{align*} where the third line is obtained by observing that the optimal matrix $$\boldsymbol{X}$$ will have all eigenvalues equal, i.e. $$\boldsymbol{X}= \eta \boldsymbol{I}$$ and the fourth line is obtained by plugging in the optimiser $$\eta = \frac{D}{\sqrt{\mathop{\mathrm{tr}}(\boldsymbol{G})}}$$. Surprisingly, this is the Online Gradient Descent tuning $$\eqref{eq:GD.tune}$$ and its bound $$\eqref{eq:GD.bound}$$ with $$\|\boldsymbol{u}\| \le D$$. Hence Online Gradient Descent (with doubling trick) achieves this bound up to a small constant factor.

# Discussion

Online Gradient Descent has regret of order $$D \sqrt{\mathop{\mathrm{tr}}(\boldsymbol{G})}$$ (e.g. using a doubling trick). Full AdaGrad has regret of order $$D \mathop{\mathrm{tr}}(\sqrt{\boldsymbol{G}})$$. The latter is always at least as high as the former (by concavity of the square root) and it may be a factor $$\sqrt{d}$$ higher (in case $$\boldsymbol{G}\propto \boldsymbol{I}$$).

So what is the benefit of Full AdaGrad over Online Gradient Descent? For linear losses (the simplest case) I do not know. It may exist. But I have not found any regret analysis bringing it out. If you do, please let me know. I would be very interested.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59. http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
Hazan, Elad. 2016. “Introduction to Online Convex Optimization.” Foundations and Trends in Optimization 2 (3–4): 157–325. https://doi.org/10.1561/2400000013.
Orabona, Francesco, and Dávid Pál. 2016. “Scale-Free Online Learning.” CoRR abs/1601.01974. http://arxiv.org/abs/1601.01974.