We start by reviewing two common strategies in online learning: gradient descent and exponential weights. We then see that gradient descent can be regarded as an instance of exponential weights. We conclude by discussing why that viewpoint is useful.

Prediction proceeds in rounds. In round \(t\) the learner produces a prediction \({\boldsymbol x}_t \in {\mathbb R}^d\), encounters a loss vector \({\boldsymbol {\ell}}_t \in {\mathbb R}^d\), and incurs loss \({\boldsymbol x}_t^{\intercal}{\boldsymbol {\ell}}_t\) given by the dot product (we are sweeping the gradient trick under the carpet here). We first look at two strategies to choose the predictions \({\boldsymbol x}_t\). For the purpose of this post we take a fixed learning rate \(\eta > 0\).

Gradient Descent

Our first method of interest, (online) gradient descent, is typically motivated by the optimisation problem \begin{equation}\label{eq:gd.opt} {\boldsymbol x}_{T+1} ~=~ \operatorname*{argmin}_{{\boldsymbol x}}~ {\|{\boldsymbol x}- {\boldsymbol x}_1\|}^2 + \eta \sum_{t=1}^T {\boldsymbol x}^{\intercal}{\boldsymbol {\ell}}_t , \end{equation} where the initial point \({\boldsymbol x}_1\) is usually set to \({\boldsymbol x}_1 = {\boldsymbol 0}\). The solution is \begin{equation}\label{eq:gd.weigths} {\boldsymbol x}_{T+1} ~=~ {\boldsymbol x}_1 - \eta \sum_{t=1}^T {\boldsymbol {\ell}}_t \quad \text{or, incrementally,} \quad {\boldsymbol x}_{T+1} ~=~ {\boldsymbol x}_T - \eta {\boldsymbol {\ell}}_T . \end{equation}

Exponential Weigths

The second method, exponential weights, produces probability distributions, and we will denote these by \({\boldsymbol w}_t\) instead of \({\boldsymbol x}_t\). Exponential weights is motivated by a similar optimisation problem \begin{equation}\label{eq:ew.opt} {\boldsymbol w}_{T+1} ~=~ \operatorname*{argmin}_{{\boldsymbol w}}~ \operatorname{KL}({\boldsymbol w}\|{\boldsymbol w}_1) + \eta \sum_{t=1}^T {\boldsymbol w}^{\intercal}{\boldsymbol {\ell}}_t , \end{equation} now with the Kullback-Leibler divergence between distributions taking the place of the squared Euclidean norm between points in \(\eqref{eq:gd.opt}\). The initial point \({\boldsymbol w}_1\) is commonly set to the uniform distribution. The closed-form solution is \begin{equation}\label{eq:eq} w_{T+1}^k ~=~ \frac{w_1^k e^{-\eta \sum_{t=1}^T {\ell}_t^k}}{\sum_k w_1^k e^{-\eta \sum_{t=1}^T {\ell}_t^k}} \quad \text{or, incrementally,} \quad w_{T+1}^k ~=~ \frac{w_T^k e^{-\eta {\ell}_T^k}}{\sum_k w_T^k e^{-\eta {\ell}_T^k}} . \end{equation}


The main point of this post is to render gradient descent as an instance of exponential weights. This will work as follows. We presented exponential weights for a finite set of dimensions or “experts”. Here we generalise that slightly. We will consider all points in \({\mathbb R}^d\) as experts, and maintain weights on these in the form of a density. We start out with a Gaussian prior density, and show that each subsequent “posterior” distribution is again Gaussian. Moreover, and this is the crux, we will see that the resulting posterior mean evolves exactly according to the GD update equation \(\eqref{eq:gd.weigths}\).

Here is the detailed version. We take as our space the set \({\mathbb R}^d\). We take as our prior \(w_1\) a multivariate Gaussian density with mean \({\boldsymbol x}_1\) and identity covariance \[w_1({\boldsymbol x}) ~=~ (2 \pi)^{-d/2} e^{-\frac{1}{2} {\|{\boldsymbol x}-{\boldsymbol x}_1\|}^2} .\] Following \(\eqref{eq:eq}\), the exponential weights update results in new weights \[w_{T+1}({\boldsymbol x}) ~=~ \frac{ w_1({\boldsymbol x}) e^{-\eta \sum_{t=1}^T {\boldsymbol x}^{\intercal}{\boldsymbol {\ell}}_t} }{ \int w_1({\boldsymbol x}) e^{-\eta \sum_{t=1}^T {\boldsymbol x}^{\intercal}{\boldsymbol {\ell}}_t} \text{d} {\boldsymbol x}} ~=~ (2 \pi)^{-d/2} e^{-\frac{1}{2} {\left\|{\boldsymbol x}-{\boldsymbol x}_1 + \eta \sum_{t=1}^T {\boldsymbol {\ell}}_t\right\|}^2} ,\] as can be seen by completing the square. Hence \(w_{T+1}\) is again a multivariate Gaussian with identity covariance and mean \[{\boldsymbol \mu}_{T+1} ~=~ {\boldsymbol x}_1 - \eta \sum_{t=1}^T {\boldsymbol {\ell}}_t .\] So indeed exponential weights yields a Gaussian with mean equal to GD \(\eqref{eq:gd.weigths}\). That is, one way to think about GD is as a collapsed implementation of exponential weights.


This perspective of gradient descent as an instance of exponential weights is not news by itself. Yet it provides a bridge to transport extensions developed in the exponential weights world to the gradient descent world. I am thinking about, for example, constructions involving specialists (Freund et al. (1997), Chernov and Vovk (2009), Koolen, Adamskiy, and Warmuth (2012)) (this is the route taken by Luo and Schapire (2015 Section 5.1)), and methods to learn the learning rate \(\eta\) (which we considered a fixed constant in this post) Koolen and Van Erven (2015).

This perspective also hints at a possible shortcoming of gradient descent: it does not learn the covariance. In fact, this is what more advanced methods like Online Newton Step by Hazan, Agarwal, and Kale (2007) and AdaGrad by Duchi, Hazan, and Singer (2011) do do.

Chernov, Alexey, and Vladimir Vovk. 2009. “Prediction with Expert Evaluators’ Advice.” In ALT Proceedings, 5809:8–22. Lecture Notes in Computer Science.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59.

Freund, Yoav, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. 1997. “Using and Combining Predictors That Specialize.” In STOC Proceedings, 334–43.

Hazan, Elad, Amit Agarwal, and Satyen Kale. 2007. “Logarithmic Regret Algorithms for Online Convex Optimization.” Machine Learning 69 (2-3). Kluwer Academic Publishers, Boston: 169–92.

Koolen, Wouter M., Dmitry Adamskiy, and Manfred K. Warmuth. 2012. “Putting Bayes to Sleep.” In Advances in Neural Information Processing Systems (NIPS) 25, 135–43.

Koolen, Wouter M., and Tim van Erven. 2015. “Second-Order Quantile Methods for Experts and Combinatorial Games.” In Proceedings of the 28th Annual Conference on Learning Theory (COLT), 1155–75.

Luo, Haipeng, and Robert E. Schapire. 2015. “Achieving All with No Parameters: Adaptive NormalHedge.” In Proceedings of the 28th Annual Conference on Learning Theory (COLT), edited by Peter Grünwald, Elad Hazan, and Satyen Kale, 1286–1304.