2016-02-21

Abstract

We regard gradient descent as an instance of exponential weights.

We start by reviewing two common strategies in online learning: gradient descent and exponential weights. We then see that gradient descent can be regarded as an instance of exponential weights. We conclude by discussing why that viewpoint is useful.

Prediction proceeds in rounds. In round \(t\) the learner produces a prediction \(\boldsymbol x_t \in \mathbb R^d\), encounters a loss vector \(\boldsymbol \ell_t \in \mathbb R^d\), and incurs loss \(\boldsymbol x_t^\intercal\boldsymbol \ell_t\) given by the dot product (we are sweeping the gradient trick under the carpet here). We first look at two strategies to choose the predictions \(\boldsymbol x_t\). For the purpose of this post we take a fixed learning rate \(\eta > 0\).

Our first method of interest, (online) *gradient descent*, is typically motivated by the optimisation problem \begin{equation}\label{eq:gd.opt}
\boldsymbol x_{T+1} ~=~ \mathop{\mathrm{argmin}}_{\boldsymbol x}~ \|\boldsymbol x- \boldsymbol x_1\|^2 + \eta \sum_{t=1}^T \boldsymbol x^\intercal\boldsymbol \ell_t
,
\end{equation} where the initial point \(\boldsymbol x_1\) is usually set to \(\boldsymbol x_1 = \boldsymbol 0\). The solution is \begin{equation}\label{eq:gd.weigths}
\boldsymbol x_{T+1}
~=~
\boldsymbol x_1 - \eta \sum_{t=1}^T \boldsymbol \ell_t
\quad
\text{or, incrementally,}
\quad
\boldsymbol x_{T+1}
~=~
\boldsymbol x_T - \eta \boldsymbol \ell_T
.
\end{equation}

The second method, *exponential weights*, produces probability distributions, and we will denote these by \(\boldsymbol w_t\) instead of \(\boldsymbol x_t\). Exponential weights is motivated by a similar optimisation problem \begin{equation}\label{eq:ew.opt}
\boldsymbol w_{T+1} ~=~ \mathop{\mathrm{argmin}}_{\boldsymbol w}~ \mathop{\mathrm{KL}}(\boldsymbol w\|\boldsymbol w_1) + \eta \sum_{t=1}^T \boldsymbol w^\intercal\boldsymbol \ell_t
,
\end{equation} now with the Kullback-Leibler divergence between distributions taking the place of the squared Euclidean norm between points in \(\eqref{eq:gd.opt}\). The initial point \(\boldsymbol w_1\) is commonly set to the uniform distribution. The closed-form solution is \begin{equation}\label{eq:eq}
w_{T+1}^k ~=~ \frac{w_1^k e^{-\eta \sum_{t=1}^T \ell_t^k}}{\sum_k w_1^k e^{-\eta \sum_{t=1}^T \ell_t^k}}
\quad
\text{or, incrementally,}
\quad
w_{T+1}^k ~=~ \frac{w_T^k e^{-\eta \ell_T^k}}{\sum_k w_T^k e^{-\eta \ell_T^k}}
.
\end{equation}

The main point of this post is to render gradient descent as an instance of exponential weights. This will work as follows. We presented exponential weights for a finite set of dimensions or “experts”. Here we generalise that slightly. We will consider all points in \(\mathbb R^d\) as experts, and maintain weights on these in the form of a density. We start out with a Gaussian prior density, and show that each subsequent “posterior” distribution is again Gaussian. Moreover, and this is the crux, we will see that the resulting posterior mean evolves exactly according to the GD update equation \(\eqref{eq:gd.weigths}\).

Here is the detailed version. We take as our space the set \(\mathbb R^d\). We take as our prior \(w_1\) a multivariate Gaussian density with mean \(\boldsymbol x_1\) and identity covariance \[w_1(\boldsymbol x)
~=~
(2 \pi)^{-d/2} e^{-\frac{1}{2} \|\boldsymbol x-\boldsymbol x_1\|^2}
.\] Following \(\eqref{eq:eq}\), the exponential weights update results in new weights \[w_{T+1}(\boldsymbol x)
~=~
\frac{
w_1(\boldsymbol x) e^{-\eta \sum_{t=1}^T \boldsymbol x^\intercal\boldsymbol \ell_t}
}{
\int
w_1(\boldsymbol x) e^{-\eta \sum_{t=1}^T \boldsymbol x^\intercal\boldsymbol \ell_t}
\text{d} \boldsymbol x
}
~=~
(2 \pi)^{-d/2} e^{-\frac{1}{2} \left\|\boldsymbol x-\boldsymbol x_1 + \eta \sum_{t=1}^T \boldsymbol \ell_t\right\|^2}
,\] as can be seen by completing the square. Hence \(w_{T+1}\) is again a multivariate Gaussian with identity covariance and mean \[\boldsymbol \mu_{T+1} ~=~ \boldsymbol x_1 - \eta \sum_{t=1}^T \boldsymbol \ell_t
.\] So indeed exponential weights yields a Gaussian with mean equal to GD \(\eqref{eq:gd.weigths}\). That is, one way to think about GD is as a *collapsed implementation* of exponential weights.

This perspective of gradient descent as an instance of exponential weights is not news by itself. Yet it provides a bridge to transport extensions developed in the exponential weights world to the gradient descent world. I am thinking about, for example, constructions involving *specialists* ((Freund et al. 1997), (Chernov and Vovk 2009), (Koolen, Adamskiy, and Warmuth 2012)) (this is the route taken by (Luo and Schapire 2015, sec. 5.1)), and methods to *learn the learning rate \(\eta\)* (which we considered a fixed constant in this post) (Koolen and Van Erven 2015).

This perspective also hints at a possible shortcoming of gradient descent: it does not learn the covariance. In fact, this is what more advanced methods like Online Newton Step by (Hazan, Agarwal, and Kale 2007) and AdaGrad by (Duchi, Hazan, and Singer 2011) do do.

Chernov, Alexey, and Vladimir Vovk. 2009. “Prediction with Expert Evaluators’ Advice.” In *ALT Proceedings*, 5809:8–22. Lecture Notes in Computer Science.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” *Journal of Machine Learning Research* 12: 2121–59. http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.

Freund, Yoav, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. 1997. “Using and Combining Predictors That Specialize.” In *STOC Proceedings*, 334–43.

Hazan, Elad, Amit Agarwal, and Satyen Kale. 2007. “Logarithmic Regret Algorithms for Online Convex Optimization.” *Machine Learning* 69 (2-3): 169–92. http://dx.doi.org/10.1007/s10994-007-5016-8.

Koolen, Wouter M., Dmitry Adamskiy, and Manfred K. Warmuth. 2012. “Putting Bayes to Sleep.” In *Advances in Neural Information Processing Systems (NIPS) 25*, 135–43. http://papers.nips.cc/paper/4557-putting-bayes-to-sleep.pdf.

Koolen, Wouter M., and Tim van Erven. 2015. “Second-Order Quantile Methods for Experts and Combinatorial Games.” In *Proceedings of the 28th Annual Conference on Learning Theory (COLT)*, 1155–75. http://jmlr.org/proceedings/papers/v40/Koolen15a.pdf.

Luo, Haipeng, and Robert E. Schapire. 2015. “Achieving All with No Parameters: Adaptive NormalHedge.” In *Proceedings of the 28th Annual Conference on Learning Theory (COLT)*, edited by Peter Grünwald, Elad Hazan, and Satyen Kale, 1286–1304. http://jmlr.org/proceedings/papers/v40/Luo15.pdf.