2016-10-31
I recently encountered AdaFTRL by (Orabona and Pál 2016). This algorithm generalises both AdaHedge (De Rooij et al. 2014) which uses entropy regularisation and a version of AdaGrad (Duchi, Hazan, and Singer 2011) that operates with squared norm regularisation. In this post, in order for me to understand AdaFTRL better, I analyse it in the AdaHedge language.
We look at online linear optimisation over a compact convex set \(\mathcal U\subseteq \mathbb R^d\). Each round the learner plays a point \({\boldsymbol w}_t \in \mathcal U\). The environment then reveals the coefficient vector \({\boldsymbol \ell}_t\) of the linear loss function, and the learner incurs loss \(\left\langle{\boldsymbol w}_t,{\boldsymbol \ell}_t\right\rangle\). Over the course of \(T\) rounds, the learner accumulates loss \[H_T ~=~ \sum_{t=1}^T \left\langle{\boldsymbol w}_t,{\boldsymbol \ell}_t\right\rangle,\] while the best point in hindsight incurs \[L_T^* ~=~ \min_{{\boldsymbol u}\in \mathcal U}~ \sum_{t=1}^T \left\langle{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle\] The objective of the learner is to minimise the regret \[%\text{Regret} %~=~ H_T - L_T^*.\] We would like to do this anytime, i.e. in a way that does this for all \(T\), worst-case safe, i.e. in a way that works even if the loss vectors \({\boldsymbol \ell}_t\) are selected by an adaptive adversary, and scale-free, i.e. in a way that does not assume anything about the scale of the sequence of losses \({\boldsymbol \ell}_1, {\boldsymbol \ell}_2, \ldots\) So even though the algorithm does not need to know the scale, it will appear in the regret. To quantify the scale we will make use of the effective loss range in round \(t\), which we define by \begin{equation}\label{eq:range} s_t ~=~ \max_{{\boldsymbol u}\in \mathcal U} \langle{\boldsymbol u},{\boldsymbol \ell}_t\rangle - \min_{{\boldsymbol u}\in \mathcal U} \langle{\boldsymbol u},{\boldsymbol \ell}_t\rangle , \end{equation} and the overall effective loss range, defined by by \[S_T ~=~ \max_{t \le T} s_t .\] Ok, so let’s look at the algorithm.
We now consider the AdaFTRL algorithm as introduced by (Orabona and Pál 2016). The “FTRL” part indicates that this algorithm is of the follow the regularized leader template. The regularisation is given by a bounded, strictly convex function \(R : \mathcal U\to \mathbb R\). We will assume that \(\min_{{\boldsymbol u}\in \mathcal U} R({\boldsymbol u}) = 0\), which may be achieved by adding a constant, and we will use the shorthand \(D = \max_{{\boldsymbol u}\in \mathcal U}~ R({\boldsymbol u})\). The “Ada” part indicates that AdaFTRL adaptively scales the regularisation by a sequence of positive learning rates \(\eta_1 \ge \eta_2 \ge \ldots\) In round \(t\), AdaFTRL plays \begin{equation}\label{eq:adaftrl.weight} {\boldsymbol w}_t ~=~ \operatorname*{arg\,min}_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol L}_{t-1}\right\rangle + \frac{R({\boldsymbol u})}{\eta_t} \end{equation} where \({\boldsymbol L}_T = \sum_{t=1}^T {\boldsymbol \ell}_t\) is the total loss vector after \(T\) rounds. In the first round AdaFTRL will use \(\eta_1 = \infty\), and play \({\boldsymbol w}_1 = \operatorname*{arg\,min}_{{\boldsymbol u}\in \mathcal U} R({\boldsymbol u})\), the limit of \(\eta_1 \to \infty\) in \(\eqref{eq:adaftrl.weight}\).
The central concept for the choice of the learning rates and for the regret analysis will be the cumulative mix loss at learning rate \(\eta\) after \(T\) rounds, defined by \begin{equation}\label{ex:cummixloss} M_T^\eta ~=~ \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol L}_T\right\rangle + \frac{R({\boldsymbol u})}{\eta} . \end{equation} We see that the weights \(\eqref{eq:adaftrl.weight}\) minimize \(M_{t-1}^{\eta_t}\). We then define the instantaneous and cumulative linear loss, mix loss and mixability gap by \begin{align*} h_t &~=~ \langle{\boldsymbol w}_t,{\boldsymbol \ell}_t\rangle, & m_t &~=~ M_t^{\eta_t} - M_{t-1}^{\eta_t}, & \delta_t &~=~ h_t - m_t, \\ H_T &~=~ \sum_{t=1}^T h_t, & M_T &~=~ \sum_{t=1}^T m_t, & \Delta_T &~=~ \sum_{t=1}^T \delta_t. \end{align*} Finally, AdaFTRL chooses the sequences of learning rates \begin{equation}\label{eq:adaftrl.eta} \eta_t ~=~ \frac{D}{\Delta_{t-1}}, \end{equation} so indeed \(\eta_1 = \infty\) since \(\Delta_0 = 0\). (One may avoid most infinities by reparametrising by the reciprocal \(\eta_t^{-1}\) throughout, as is done by (Orabona and Pál 2016). The current form of \(\eqref{eq:adaftrl.weight}\), \(\eqref{ex:cummixloss}\) and \(\eqref{eq:adaftrl.eta}\) is the rendering closest to (De Rooij et al. 2014).)
In the analysis given in this post I try to stay as close as possible to that in (De Rooij et al. 2014). To show that the learning rates \(\eqref{eq:adaftrl.eta}\) are indeed decreasing, we show that the mixability gap is positive each round.
Proposition 1. \(\delta_t \ge 0\).
Proof Since \({\boldsymbol w}_t\) is the minimiser of \(M_{t-1}^{\eta_t}\), we have \[m_t ~=~ M_t^{\eta_t} - M_{t-1}^{\eta_t} ~\le~ \left( \left\langle{\boldsymbol w}_t,{\boldsymbol L}_T\right\rangle + \frac{R({\boldsymbol w}_t)}{\eta_t} \right) - \left( \left\langle{\boldsymbol w}_t,{\boldsymbol L}_{t-1}\right\rangle + \frac{R({\boldsymbol w}_t)}{\eta_t} \right) ~=~ \left\langle{\boldsymbol w}_t,{\boldsymbol \ell}_t\right\rangle ~=~ h_t .\] The claim then follows by the definition of \(\delta_t = h_t - m_t\). \(\Box\)
We will additionally need an upper bound on the mixability gap \(\delta_t\) in terms of the effective loss range \(\eqref{eq:range}\).
Proposition 2. \(\delta_t \le s_t\).
Proof Clearly \(h_t \le \max_{{\boldsymbol u}\in \mathcal U} \langle{\boldsymbol u},{\boldsymbol \ell}_t\rangle\). Also \[m_t ~=~ M_t^{\eta_t} - M_{t-1}^{\eta_t} ~=~ \left( \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol L}_t\right\rangle + \frac{R({\boldsymbol u})}{\eta_t} \right) - \left( \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol L}_{t-1}\right\rangle + \frac{R({\boldsymbol u})}{\eta_t} \right) ~\ge~ \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle .\] It follows that \(\delta_t \le \max_{{\boldsymbol u}\in \mathcal U}~ \langle{\boldsymbol u},{\boldsymbol \ell}_t\rangle - \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle = s_t\). \(\Box\)
We then show that the mix loss telescopes, in the sense that it is bounded from above by the mix loss using the last (and lowest) learning rate throughout.
Proposition 3. For all \(T \ge 0\), \(M_T \le M_T^{\eta_{T+1}}\)
Proof By induction. For \(T=0\), using \(\min_{{\boldsymbol u}\in \mathcal U} R({\boldsymbol u}) = 0\), we find \[M_0 ~=~ 0 ~=~ - \frac{R({\boldsymbol w}_1)}{\eta_1} ~=~ M_0^{\eta_1} .\] For \(T \ge 1\), we use \[M_T ~=~ M_{T-1} + M_T^{\eta_T} - M_{T-1}^{\eta_T} ~\le~ M_{T-1}^{\eta_T} + M_T^{\eta_T} - M_{T-1}^{\eta_T} ~\le~ M_T^{\eta_{T+1}} ,\] where the first inequality is the induction hypothesis and the second one uses that \(M_T^\eta\) is decreasing in \(\eta\) (as is clear from \(\eqref{ex:cummixloss}\)) and that \(\eta_{T} \ge \eta_{T+1}\) (by Proposition 1). \(\Box\)
This shows that the mix loss conveniently telescopes. Why is that useful? Well, by definition \(\eqref{ex:cummixloss}\), the cumulative mix loss is almost as small as the loss of any fixed point (in particular the best point in hindsight).
Proposition 4. For any \({\boldsymbol u}\in \mathcal U\), \(M_T^{\eta} \le \left\langle{\boldsymbol u},{\boldsymbol L}_T\right\rangle + \frac{R({\boldsymbol u})}{\eta}\).
With our results so far we can prove the following intermediate regret bound.
Proposition 5. \(H_T - L_T^* \le 2 \Delta_T\).
Proof By Propositions 3 and 4, \[H_T - L_T^* ~=~ M_T - L_T^* + \Delta_T ~\le~ M_T^{\eta_{T+1}} - L_T^* + \Delta_T ~\le~ \frac{R({\boldsymbol u}^*)}{\eta_{T+1}} + \Delta_T ~\le~ \frac{D}{\eta_{T+1}} + \Delta_T ~=~ 2 \Delta_T .\] \(\Box\)
We interpret this bound as saying that AdaFTRL manages to approximately equalise the two contributions to the regret, namely the mix loss regret \(M_T - L_T^*\) and the mixability gap \(\Delta_T\).
So the goal is now to bound \(\Delta_T\). Expecting the regret expression to be a square root, the next step is to bound \(\Delta_T^2\).
Proposition 6. \(\Delta_T^2 \le S_T \Delta_T + 2 D \sum_{t=1}^T \frac{\delta_t}{\eta_t}\).
Proof Well \[\Delta_T^2 ~=~ \sum_{t=1}^T \left(\Delta_t^2 - \Delta_{t-1}^2\right) ~=~ \sum_{t=1}^T \left(\delta_t^2 + 2 \delta_t \Delta_{t-1}\right) ~=~ \sum_{t=1}^T \left(\delta_t^2 + 2 D \frac{\delta_t}{\eta_t}\right) ~\le~ S_T \Delta_T + 2 D \sum_{t=1}^T \frac{\delta_t}{\eta_t} ,\] since \(\delta_t \le s_t\) by Proposition 2 and \(s_t \le S_T\) by definition. \(\Box\)
This reduces the goal to bounding \(\frac{\delta_t}{\eta}\) each round. Once we manage to do that, we will use the following elementary result to get a regret bound.
Proposition 7. Let \(a,b \ge 0\). Then \[\max \left\{2 \Delta\middle|\Delta^2 \le a \Delta + b\right\} ~=~ \sqrt{a^2+4 b}+a ~\le~ 2 \sqrt{b} + 2 a .\]
Combining Propositions 5, 6 and 7 results in the regret bound \begin{equation}\label{eq:generic.regret} H_T - L_T^* ~\le~ \sqrt{8 D \sum_{t=1}^T \frac{\delta_t}{\eta_t}} + 2 S_T . \end{equation}
Next we need to find a tight bound on \(\frac{\delta_t}{\eta_t}\). The details of this will depend on the specifics of the domain and the regulariser. Let us look at two important cases, the Hedge setting with entropy regularisation and online linear optimisation with norm-based regularisation.
Consider a setting with \(K\) experts. Here we take \(\mathcal U= \triangle_K\) the probability simplex and \(R({\boldsymbol u}) = \operatorname{KL}({\boldsymbol u}\|1/K)\) the Kullback-Leibler divergence from the uniform distribution. With this substitution AdaFTRL becomes the AdaHedge algorithm of (De Rooij et al. 2014). To get a regret bound we use Hoeffding’s Inequality (Cesa-Bianchi and Lugosi 2006 Lemma A.1.1) \[\frac{\delta_t}{\eta_t} ~\le~ s_t^2/8,\] and as \(D = \ln K\), by \(\eqref{eq:generic.regret}\), we obtain the final regret bound \[H_T - L^*_T ~\le~ 2 \Delta_T ~\le~ \sqrt{\ln K \sum_{t=1}^T s_t^2} + 2 S_T .\] Even though we do not treat them here, we remark that (De Rooij et al. 2014) give much fancier bounds of second-order form, which imply constant regret in many stochastic scenarios. Such bounds are based on Bernstein instead of Hoeffding. See also this post for a taxonomy of such bounds.
So far so good. The next application will be online linear optimisation. But before we go there, we first take a small intermezzo to deal with an issue that is not visible in the simplex case (because the entropy regulariser is of Legendre type).
To find an upper bound on \(\delta_t = h_t - m_t\), we will use the following quantity to obtain a lower bound on \(m_t\). From this point on we will assume that the regulariser \(R\) is continuously differentiable. Let \(\triangle_R({\boldsymbol u}\|{\boldsymbol w}) = R({\boldsymbol u}) - R({\boldsymbol w}) - \left\langle{\boldsymbol u}-{\boldsymbol w}, \nabla R({\boldsymbol w})\right\rangle\) be the Bregman divergence generated by \(R\), and define the incremental mix loss by \[y_t^\eta ~=~ \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle + \frac{\triangle_R({\boldsymbol u}\|{\boldsymbol w}_t)}{\eta} .\] Note that the definition of the incremental mix loss \(y_t^\eta\) only involves a single optimisation problem, while that of \(m_t = M_t^{\eta_t} - M_{t-1}^{\eta_t}\) involves the difference of two. Here is how we use it.
Proposition 8. \(y_t^{\eta_t} \le m_t\).
Proof Recall that \({\boldsymbol w}_t\) minimises the differentiable convex objective in \(M_{t-1}^{\eta_t}\) and hence satisfies (see (Boyd and Vandenberghe 2004 Eq. (4.21))) \[\left\langle{\boldsymbol u}-{\boldsymbol w}_t,{\boldsymbol L}_{t-1} + \frac{\nabla R({\boldsymbol w}_t)}{\eta} \right\rangle ~\ge~ 0 \qquad \text{for all ${\boldsymbol u}\in \mathcal U$}.\] Expanding definitions and applying this inequality results in \begin{align*} y_t^{\eta_t} + M_{t-1}^{\eta_t} &~=~ \min_{{\boldsymbol u}\in \mathcal U}~ \left(\left\langle{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle + \frac{\triangle_R ({\boldsymbol u}\|{\boldsymbol w}_t)}{\eta}\right) + \left\langle{\boldsymbol w}_t,{\boldsymbol L}_{t-1}\right\rangle + \frac{R({\boldsymbol w}_t)}{\eta} \\ &~=~ \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol L}_t\right\rangle + \frac{R({\boldsymbol u})}{\eta} - \left\langle{\boldsymbol u}-{\boldsymbol w}_t,{\boldsymbol L}_{t-1} + \frac{\nabla R({\boldsymbol w}_t)}{\eta}\right\rangle \\ &~\le~ \min_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol u},{\boldsymbol L}_t\right\rangle + \frac{R({\boldsymbol u})}{\eta} ~=~ M_t^{\eta_t} \end{align*} as desired. \(\Box\)
Using Proposition 8, we conclude that \begin{equation}\label{eq:delta.and.y} \delta_t ~=~ h_t - m_t ~\le~ h_t - y_t^{\eta_t} . \end{equation}
The prototypical norm-based regulariser \(R({\boldsymbol u}) = \frac{1}{2} \|{\boldsymbol u}\|_2^2\) is the squared Euclidean norm. More generally we can work with any regulariser \(R({\boldsymbol u})\) that is strongly convex with respect to some norm \(\|\cdot\|\), i.e. \[\triangle_R\left({\boldsymbol u}\middle\|{\boldsymbol w}\right) ~\ge~ \frac{1}{2} \|{\boldsymbol u}-{\boldsymbol w}\|^2 \qquad \text{for all ${\boldsymbol u},{\boldsymbol w}\in \mathcal U$} .\] The bounds will be phrased in terms of the dual norm \(\left\|\cdot\right\|_*\), which is defined by \[\left\|{\boldsymbol u}\right\|_* ~=~ \max \big\{ \left\langle{\boldsymbol w},{\boldsymbol u}\right\rangle\big|\left\|{\boldsymbol w}\right\| = 1\big\} .\] To obtain a regret bound, we use \begin{align*} h_t - y_t^{\eta_t} &~=~ \left\langle{\boldsymbol w}_t,{\boldsymbol \ell}_t\right\rangle - \min_{{\boldsymbol u}\in \mathcal U}~ \left(\left\langle{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle + \frac{\triangle_R({\boldsymbol u}\|{\boldsymbol w}_t)}{\eta_t} \right) \\ &~\le~ \max_{{\boldsymbol u}\in \mathcal U}~ \left\langle{\boldsymbol w}_t -{\boldsymbol u},{\boldsymbol \ell}_t\right\rangle - \frac{\|{\boldsymbol u}-{\boldsymbol w}_t\|^2}{2 \eta_t} \\ &~\le~ \max_{{\boldsymbol x}\in \mathbb R^d}~ \left\langle{\boldsymbol x},{\boldsymbol \ell}_t\right\rangle - \frac{\|{\boldsymbol x}\|^2}{2 \eta_t} \\ &~=~ \max_{\lambda \ge 0} \max_{{\boldsymbol x}\in \mathbb R^d : \|{\boldsymbol x}\| = 1}~ \lambda \left\langle{\boldsymbol x},{\boldsymbol \ell}_t\right\rangle - \frac{\lambda^2}{2 \eta_t} \\ &~=~ \max_{\lambda \ge 0}~ \lambda \left\|{\boldsymbol \ell}_t\right\|_* - \frac{\lambda^2}{2 \eta_t} \\ &~=~ \frac{\eta_t \left\|{\boldsymbol \ell}_t\right\|_*^2}{2} \end{align*} It follows by \(\eqref{eq:delta.and.y}\) that \[\frac{\delta_t}{\eta_t} ~\le~ \frac{\left\|{\boldsymbol \ell}_t\right\|_*^2}{2},\] and hence by \(\eqref{eq:generic.regret}\) that the regret is at most \[H_T - L_T^* ~\le~ 2 \sqrt{D \sum_{t=1}^T \left\|{\boldsymbol \ell}_t\right\|_*^2} + 2 S_T .\] It appears that this regret bound improves the leading constant of \(5.2\) in (Orabona and Pál 2016 Corollary 1) to \(2\) at the cost of an additive lower-order term \(2 S_T\).
AdaFTRL with the Euclidean norm \(\|\cdot\|_2\) is a version of AdaGrad by (Duchi, Hazan, and Singer 2011) that maintains a single learning rate. (Duchi, Hazan, and Singer 2011) develop more sophisticated algorithms that maintain either one learning rate per dimension (diagonal AdaGrad) or per direction (full AdaGrad).
See (Orabona and Pál 2016) for more discussion about the relationship of the algorithms, a historic perspective and a generalisation to unbounded domains \(\mathcal U\).
Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59. http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
Orabona, Francesco, and Dávid Pál. 2016. “Scale-Free Online Learning.” CoRR abs/1601.01974. http://arxiv.org/abs/1601.01974.
Rooij, Steven de, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. 2014. “Follow the Leader If You Can, Hedge If You Must.” Journal of Machine Learning Research 15 (April): 1281–1316. http://jmlr.org/papers/volume15/rooij14a/rooij14a.pdf.