2015-08-13
We derive the relative entropy version of the second-order quantile regret bound for Squint.
I want to talk about Squint, a new prediction strategy Tim van Erven and I recently discovered (see (Koolen and Van Erven 2015)). Squint operates in the following protocol, called the Hedge setting. Prediction proceeds in rounds. In round \(t\) the learner plays a probability vector \(\boldsymbol{w}_t\) on experts \(k \in \{1, \ldots, K\}\). Then the adversary reveals the vector \(\boldsymbol{\ell}_t \in [0,1]^K\) of expert losses and the learner incurs the dot loss \(\boldsymbol{w}_t^\intercal\boldsymbol{\ell}_t\).
Let us write \(r_t^k := (\boldsymbol{w}_t - \boldsymbol{e}_k)^\intercal\boldsymbol{\ell}_t\) for the instantaneous regret w.r.t. expert \(k\) in round \(t\). Also let us abbreviate the cumulative regret to \(R_T^k := \sum_{t=1}^T r_t^k\) and the cumulative regret squared to \(V_T^k := \sum_{t=1}^T (r_t^k)^2\). The Squint strategy is based on (motivated by) the potential function \[\Phi_T ~:=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \bigg[ e^{\eta R_T^k - \eta^2 V_T^k} -1 \bigg] ,\] where \(\pi\) is a prior on experts \(k \in \{1, \ldots, K\}\) and \(\gamma\) is a prior on learning rates \(\eta \in [0,1/2]\). The Squint weights \[\boldsymbol{w}_t ~=~ \frac{ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ \eta \boldsymbol{e}_k e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \right] }{ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ \eta e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \right] }\] are chosen to make the potential decreasing \(0 = \Phi_0 \ge \Phi_1 \ge \ldots\) To see this, we write \begin{align*} \Phi_{t} - \Phi_{t-1} &~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} (e^{\eta r_t^k - (\eta r_t^k)^2}-1) \right] \\ &~\le~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \eta r_t^k \right] \\ &~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \eta (\boldsymbol{w}_t - \boldsymbol{e}_k) \right]^\intercal\boldsymbol{\ell}_t \\ &~=~ 0 \end{align*} where the inequality uses that \(e^{x-x^2} \le 1 + x\) for \(x \ge -1/2\) and the last equality is due to the choice of weights \(\boldsymbol{w}_t\).
In the paper (Koolen and Van Erven 2015) we show that a negative potential implies quantile regret bounds. We show that for any subset \(\mathcal K\) of experts, with abbreviations \(R_T^\mathcal K= \mathop{\mathrm{\mathbb E}}_{\pi(k|\mathcal K)}[R_T^k]\) and \(V_T^\mathcal K= \mathop{\mathrm{\mathbb E}}_{\pi(k|\mathcal K)}[V_T^k]\), we have \begin{equation}\label{eq:orig} R_T^\mathcal K~\preceq~ \sqrt{V_T^\mathcal K\left(- \ln \pi(\mathcal K) + \ln \ln T\right)} . \end{equation}
Now what happens if we do not want to compare to distributions of the form \(\pi(k|\mathcal K)\) but to a general distribution \(\rho(k)\)? Intuitively, using abbreviations \(R_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [R_T^k]\) and \(V_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [ V_T^k]\), we should be able to prove a relative entropy regret bound \begin{equation}\label{eq:newbd} R_T^\rho ~\preceq~ \sqrt{V_T^\rho \left(\operatorname{KL}\left(\rho\middle\|\pi\right) + \ln \ln T\right)} . \end{equation} Now \(\eqref{eq:newbd}\) is a proper generalisation of \(\eqref{eq:orig}\), as applying it to \(\rho(k) = \pi(k|\mathcal K)\) gives \[\operatorname{KL}\left(\rho\middle\|\pi\right) ~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k|\mathcal K)} \left[\ln \frac{\pi(k|\mathcal K)}{\pi(k)}\right] ~=~ - \ln \pi(\mathcal K)\] When we wrote the paper, we restricted to bounds of the form \(\eqref{eq:orig}\) for simplicity. In related work, (Luo and Schapire 2015) give relative entropy bounds for a similar algorithm. Now several people have asked us whether bounds \(\eqref{eq:newbd}\) actually hold for Squint. They do hold, for unmodified Squint, without any additional effort. In this post I will show how to derive them.
Relative entropy bounds are typically based on the following variational inequality. For distributions \(\pi(k)\) and \(\rho(k)\) and random variable \(X_k\), \begin{equation}\label{eq:PAC} \mathop{\mathrm{\mathbb E}}_{\pi(k)} \left[e^{X_k}\right] ~\ge~ \mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[ \frac{\pi(k)}{\rho(k)} e^{X_k}\right] ~=~ \mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[ e^{\ln \frac{\pi(k)}{\rho(k)} + X_k}\right] %\\ ~\ge~ e^{\mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[ \ln \frac{\pi(k)}{\rho(k)} + X_k\right]} ~=~ e^{-\operatorname{KL}\left(\rho\middle\|\pi\right) + \mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[X_k\right]} . \end{equation} The first inequality is strict if \(\rho(k)\) has smaller support than \(\pi(k)\). The second is Jensen’s Inequality.
We first do the analysis for proper priors, as it is easier. Let \(\gamma\) be a proper prior. Then \(\Phi_T \le 0\) implies \[\mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \bigg[ e^{\eta R_T^k - \eta^2 V_T^k} \bigg] ~\le~ 1 .\] Now take \(\rho(k)\) any comparator distribution on experts. Then by \(\eqref{eq:PAC}\), \[\mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \bigg[ e^{\eta R_T^k - \eta^2 V_T^k} \bigg] ~\ge~ e^{ - \operatorname{KL}\left(\rho\middle\|\pi\right) } \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ e^{ \eta R_T^\rho - \eta^2 V_T^\rho } \right]\] where we abbreviated \(R_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [R_T^k]\) and \(V_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [V_T^k]\). We therefore find \[e^{ - \operatorname{KL}\left(\rho\middle\|\pi\right) } \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ e^{ \eta R_T^\rho - \eta^2 V_T^\rho } \right] ~\le~ 1\] from which it follows that for any \(\eta\) with mass \(\gamma(\eta)\) we have \[\eta R_T^\rho - \eta^2 V_T^\rho ~\le~ \operatorname{KL}\left(\rho\middle\|\pi\right) - \ln \gamma(\eta)\] Plugging in \(\hat \eta = \frac{R_T^\rho}{2 V_T^\rho}\), we obtain \[R_T^\rho ~\le~ 2 \sqrt{ V_T^\rho \left( \operatorname{KL}\left(\rho\middle\|\pi\right) - \ln \gamma(\hat\eta) \right) } .\] In the case that \(\gamma\) puts insufficient mass on \(\hat \eta\) e.g. because it is a density, we need to bound the amount of mass close enough to \(\hat \eta\). As this task is orthogonal to the question of relative entropy bounds, I refer back to our paper (Koolen and Van Erven 2015) where we show how to do this for two appropriate families of proper priors.
Let \(\gamma\) be the improper prior with density \(\gamma(\textrm d\eta) = \frac{1}{\eta} \textrm d\eta\). The motivation for this prior is that it gives good regret bounds while at the same time admitting closed-form integration of the weights, resulting in constant run time per expert per round. We start out with negative potential: \[0 ~\ge~ \Phi_T ~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_0^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] .\] For any \(\epsilon \in [0,1/2]\), we can split the integral \[\Phi_T ~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_0^{\epsilon} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] + \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg]\] To bound the first term we use \(e^x \ge 1+x\), \(R_T^k \ge -T\) and \(V_T^k \le T\) to find \[\mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_0^{\epsilon} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] ~\ge~ \int_0^{\epsilon} - T - \eta T \,\textrm d\eta ~=~ - \left( \epsilon + \frac{\epsilon^2}{2}\right) T .\] To bound the second term, we use \(\eqref{eq:PAC}\) to obtain \begin{align*} \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] &~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} }{ \eta } \textrm d\eta \bigg] - \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ 1 }{ \eta } \textrm d\eta \bigg] \\ &~\ge~ e^{- \operatorname{KL}(\rho\|\pi)} \int_\epsilon^{1/2} \frac{ e^{\eta R_T^\rho - \eta^2 V_T^\rho} }{ \eta } \textrm d\eta + \ln (2 \epsilon) . \end{align*} Plugging in the convenient choice \(\epsilon = \frac{1}{2(T+1)}\) for which \((\epsilon + \epsilon^2/2)T \le 1/2\), we now arrive at \[e^{- \operatorname{KL}(\rho\|\pi)} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R_T^\rho - \eta^2 V_T^\rho} }{ \eta } \textrm d\eta ~\le~ \frac{1}{2} + \ln (T+1) .\] To bound the integral, we need a case analysis, based on \(R = R_T^\rho\) and \(V = V_T^\rho\). There are three cases.
First the important case. If \(\hat \eta = \frac{R}{2 V}\) satisfies \(\hat \eta - \frac{1}{\sqrt{2 V}} \ge \frac{1}{2(T+1)}\) and \(\hat \eta \le 1/2\), we find \begin{align*} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R - \eta^2 V} }{ \eta } \textrm d\eta &~\ge~ \int_{\hat \eta - \frac{1}{\sqrt{2 V}}}^{\hat \eta} \frac{ e^{(\hat \eta - \frac{1}{\sqrt{2 V}}) R - (\hat \eta - \frac{1}{\sqrt{2 V}})^2 V} }{ \eta } \textrm d\eta ~=~ e^{\frac{R^2}{4 V} - \frac{1}{2}} %e^{(\hat \eta - \frac{1}{\sqrt{2 V}}) R - (\hat \eta - \frac{1}{\sqrt{2 V}})^2 V} \ln \frac{1}{1 - \frac{\sqrt{2 V}}{R}} \\ &~\ge~ e^{\frac{1}{2} \left(\frac{R}{\sqrt{2 V}} - 1\right)^2} e^{\frac{R}{\sqrt{2 V}}-1} \frac{\sqrt{2 V}}{R} ~\ge~ e^{\frac{1}{2} \left(\frac{R}{\sqrt{2 V}} - 1\right)^2} \end{align*} In this case, \[e^{- \operatorname{KL}\left(\rho\middle\|\pi\right) + \frac{1}{2} \left(\frac{R}{\sqrt{2 V}} - 1\right)^2} ~\le~ \frac{1}{2} + \ln (T+1)\] from which it follows that \[R ~\le~ \sqrt{2 V} \left( 1 + \sqrt{2 \left( \operatorname{KL}\left(\rho\middle\|\pi\right) + \ln \left( \frac{1}{2} + \ln (T+1) \right) \right) } \right) .\]
Second, if \(\hat \eta - \frac{1}{\sqrt{2 V}} < \frac{1}{2(T+1)}\) we conclude \[R ~<~ \frac{V}{T+1} + \sqrt{2 V} ~<~ 1 + \sqrt{2 V} .\]
Finally, if \(\hat \eta > 1/2\) we find \(R > V\), and hence for any \(u \in [\frac{1}{2(T+1)}, 1/2]\), \begin{align*} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R - \eta^2 V} }{ \eta } \textrm d\eta &~\ge~ \int_{u}^{1/2} \frac{ e^{u R - u^2 V} }{ \eta } \textrm d\eta ~=~ e^{u R - u^2 V} \ln \frac{1}{2 u} ~\ge~ e^{u (1-u)R} \ln \frac{1}{2 u} \end{align*} so that \[R ~\le~ \frac{1}{u (1-u)} \left( \operatorname{KL}(\rho\|\pi) + \ln \frac{ \frac{1}{2} + \ln (T+1) }{ - \ln (2 u) } \right) .\] Plugging in \(u = \frac{5 - \sqrt{5}}{10}\) (which is sufficiently large for all \(T \ge 1\)) we get \[R ~\le~ 5 \big( \operatorname{KL}(\rho\|\pi) + \ln \left( 1 + 2\ln (T+1) \right) \big)\]
Putting all three cases together, we find \[\begin{gathered} R_T^\rho ~\le~ \sqrt{2 V_T^\rho} \left( 1 + \sqrt{2 \left( \operatorname{KL}\left(\rho\middle\|\pi\right) + \ln \left( \frac{1}{2} + \ln (T+1) \right) \right) } \right) \\ + 1 + 5 \big( \operatorname{KL}(\rho\|\pi) + \ln \left( 1 + 2\ln (T+1) \right) \big),\end{gathered}\] which is the relative entropy upgrade of (Koolen and Van Erven 2015, Theorem 4).