2018-11-13
In this post we revisit the AdaHedge algorithm (Rooij et al. 2014). We are wondering about its behaviour on stochastic IID data. This question came up during a discussion with Tim van Erven at CWI in 2015, and has since then resulted in the paper (Koolen, Grünwald, and Van Erven 2016). This post is a self-contained application of its main result, namely that adaptive methods of a certain type (AdaHedge included) all have strongly reduced excess risk (aka pseudo-regret) on friendly stochastic data.
We are working in the Hedge setting with \(K\) actions. Learning proceeds in rounds. In each round \(t\) the learner assigns probabilities \((w_t^1, \ldots, w_t^K)\) to the \(K\) actions, the environment reveals the losses \((\ell_t^1, \ldots, \ell_t^K)\) of the actions and the learner incurs the average loss \(\sum_{k=1}^K w_t^k \ell_t^k\). The goal of the learner is to minimise the excess risk, which is defined after \(T\) rounds by \[R_T ~=~ \underbrace{ \sum_{t=1}^T \sum_{k=1}^K w_t^k \ell_t^k }_\text{learner loss} - \underbrace{ \min_k \sum_{t=1}^T \ell_t^k }_\text{best action loss} .\]
The Hedge — or exponential weights — strategy predicts in round \(t\) with distribution \[w_t^k ~=~ \frac{ e^{- \eta_t L_{t-1}^k} }{ \sum_{k=1}^K e^{- \eta_t L_{t-1}^k} } ,\] where \(L_T^k = \sum_{t=1}^T \ell_t^k\) is the cumulative loss of expert \(k\), and \(\eta_t\) is a parameter called the learning rate.
AdaHedge is an adaptive scheme for tuning the learning rate. It tunes the learning rate as follows \begin{equation}\label{eq:ah} \eta_t ~=~ \frac{\ln K}{\Delta_{t-1}}, \end{equation} where \(\Delta_T = \sum_{t=1}^T \delta_t\) accumulates the mixability gap, which is defined as \[\delta_t ~=~ \underbrace{ \sum_{k=1}^K w_t^k \ell_t^k }_\text{Hedge loss} - \underbrace{ \frac{-1}{\eta_t} \ln \sum_{k=1}^K w_t^k e^{-\eta_t \ell_t^k} }_\text{$\eta_t$-mix loss} .\] We see in particular that \(\eta_t\) is decreasing in \(t\), starting at \(\eta_1 = \infty\). This tuning ensures that the regret of AdaHedge satisfies (Rooij et al. 2014 Lemma 3) \begin{equation}\label{eq:ah.regret} R_T ~\le~ 2 \Delta_T . \end{equation} From this, one can derive the typical worst-case regret bound (Rooij et al. 2014 Corollary 9) \[R_T ~\le~ \sqrt{T \ln K} + \text{small} .\]
We consider a particularly easy learning setting. Loss vectors \((\ell_t^1, \ldots, \ell_t^K) \in [0,1]^K\) are generated IID. Let \(k^*\) be the index of the action with minimal expected loss (the Bayes act). We assume what we call the Massart condition (a simple instance of the Bernstein condition which is also related to the Tsybakov margin condition). The condition posits that there is a \(c > 0\) such that for all actions \(k = 1, \ldots, K\) \begin{equation}\label{eq:massart} \operatorname*{\mathbb E}[ (\ell^k - \ell^{k^*})^2] ~\le~ c \operatorname*{\mathbb E}[ \ell^k - \ell^{k^*}] . \end{equation} (we omitted time indices since the loss vectores are IID.) This condition allows us to relate the expected squared excess loss to the expected excess loss.
In this post we show that when data are generated from a Massart distribution, the excess risk of AdaHedge is constant. The excess risk is defined as the expected loss difference with the best action in expectation, i.e. \[R_T^* ~=~ \operatorname*{\mathbb E} \underbrace{ \sum_{t=1}^T \sum_{k=1}^K w_t^k \ell_t^k }_\text{learner loss} - \operatorname*{\mathbb E}\underbrace{ \sum_{t=1}^T \ell_t^{k^*} }_\text{loss of $k^*$}\] By Jensen \(R_T^* \le \operatorname*{\mathbb E}R_T\), i.e. the excess risk is at most the expected regret. To show that the excess risk is bounded, starting from \(\eqref{eq:ah.regret}\), it is key to control \(\operatorname*{\mathbb E}\delta_t\). The first step is to bound \[\delta_t ~\le~ \frac{e^{\eta_t} - \eta_t - 1}{\eta_t} \sum_{k=1}^K w_t^k (\ell_t^k - \ell_t^{k^*})^2\] as follows from using the suboptimal choice \(\lambda = \ell_t^{k^*}\) in the proof of (Koolen, Erven, and Grünwald 2014 Lemma C.3), see also my post on mixability gap bounds, Proposition 8. Taking the expectation and using the Massart condition \(\eqref{eq:massart}\) we find \[\operatorname*{\mathbb E}\delta_t ~\le~ \frac{e^{\eta_t} - \eta_t - 1}{\eta_t} c \sum_{k=1}^K w_t^k \operatorname*{\mathbb E}[\ell_t^k - \ell_t^{k^*}]\] To analyse the regret, it is convenient to divide the rounds in two types. Fix some \(\eta \ge 0\) (which we will optimise below), and let \(s\) be the first round such that \(\eta_s \le \eta\). Then by the definition \(\eqref{eq:ah}\) of AdaHedge, \(\Delta_{s-2} < \frac{\ln K}{\eta}\) and so \(\Delta_{s-1} < 1 + \frac{\ln K}{\eta}\). Then starting from the regret bound \(\eqref{eq:ah.regret}\) we find \[R_T^* ~\le~ 2 \sum_{t=1}^T \operatorname*{\mathbb E}\delta_t ~=~ 2 \sum_{t=1}^{s-1} \operatorname*{\mathbb E}\delta_t + 2 \sum_{t=s}^T \operatorname*{\mathbb E}\delta_t ~\le~ 2 + 2 \frac{\ln K}{\eta} + 2 \sum_{t=s}^T \operatorname*{\mathbb E}\delta_t .\] And finally, using \(\eta_t \le \eta\) for all \(t \ge s\) and the fact that \(\frac{e^x-x-1}{x}\) is increasing, we obtain \begin{align*} \sum_{t=s}^T \operatorname*{\mathbb E}\delta_t &~\le~ \sum_{t=s}^T \frac{e^{\eta_t} - \eta_t - 1}{\eta_t} \sum_{k=1}^K w_t^k c \operatorname*{\mathbb E}[\ell_t^k - \ell_t^{k^*}] \\ &~\le~ \frac{e^{\eta} - \eta - 1}{\eta} \sum_{t=s}^T \sum_{k=1}^K w_t^k c \operatorname*{\mathbb E}[\ell_t^k - \ell_t^{k^*}] \\ &~\le~ \frac{e^{\eta} - \eta - 1}{\eta} \sum_{t=1}^T \sum_{k=1}^K w_t^k c \operatorname*{\mathbb E}[\ell_t^k - \ell_t^{k^*}] \\ &~=~ \frac{e^{\eta} - \eta - 1}{\eta} R_T^* \end{align*} Putting everything together, we find \[R_T^* ~\le~ 2 + 2 \frac{\ln K}{\eta} + 2 \frac{e^{\eta} - \eta - 1}{\eta} c R_T^k,\] which we may reorganise to \[\left(1 - 2 \frac{e^{\eta} - \eta - 1}{\eta} c \right) R_T^* ~\le~ 2 + 2 \frac{\ln K }{\eta} .\] So if we pick \(\eta\) sufficiently small (such that \(1 \ge 2 \frac{e^{\eta} - \eta - 1}{\eta} c\), so roughly \(\eta \le \frac{1}{c}\)), we get \[R_T^* ~\le~ \frac{\eta + \ln K}{\eta/2 - c (e^{\eta} - \eta - 1)} .\] It is tricky to determine the optimal \(\eta\). A good candidate is found by optimisation without the first \(\eta\), effectively optimising the factor multiplying \(\ln K\). The optimal choice for that, found by differentiation, is \[\eta ~=~ \ln \left(1 + \frac{1}{2 c}\right) ,\] and the regret bound becomes \[R_T^* ~\le~ \frac{2 \ln K + 2 \ln \left(1+\frac{1}{2 c}\right)}{(2 c+1) \ln \left(1+\frac{1}{2 c}\right)-1} .\]
AdaHedge has constant excess risk under the Massart condition. And, simultaneously, it has a nice worst-case regret bound.
Koolen, Wouter M., Tim van Erven, and Peter D. Grünwald. 2014. “Learning the Learning Rate for Prediction with Expert Advice.” In Advances in Neural Information Processing Systems (Nips) 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, 2294–2302. http://papers.nips.cc/paper/5241-learning-the-learning-rate-for-prediction-with-expert-advice.pdf.
Koolen, Wouter M., Peter D. Grünwald, and Tim van Erven. 2016. “Combining Adversarial Guarantees and Stochastic Fast Rates in Online Learning.” In Advances in Neural Information Processing Systems (Nips) 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4457–65.
Rooij, Steven de, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. 2014. “Follow the Leader If You Can, Hedge If You Must.” Journal of Machine Learning Research 15 (April): 1281–1316. http://jmlr.org/papers/volume15/rooij14a/rooij14a.pdf.