2021-11-07
We perform a minimax analysis of the regret problem with scaled comparator gains.
In this post I describe and solve a problem that came up in a discussion with Shubhada Agrawal and Sandeep Juneja, both of TIFR Mumbai. We are interested in proving worst-case regret bounds. Traditional worst-case regret results bound the performance of a learner in terms of some baseline performance plus overhead. The idea here is to look instead at comparing to a scaled-down baseline. Ideally, this could make the problem easier, and hence the overhead smaller.
We are looking at the \(K\) experts mix gains setting with positive gains. More precisely, every round \(t=1,2,\ldots\) the learner produces a distribution \({\boldsymbol w}_t \in \triangle_K\), the adversary produces a gains vector \({\boldsymbol g}_t \in \mathbb R_+^K\) and the learner receives the mix gain \[\ln \left(\sum_{k=1}^K w_t^k e^{g_t^k}\right)\] After \(T\) rounds, the regret w.r.t. the best expert is the difference between the maximum cumulative gain among the experts, and the learner’s gain. \[R_T ~=~ \operatorname*{\vphantom{p}max}_k \sum_{t=1}^T g_t^k - \sum_{t=1}^T \ln \left(\sum_{k=1}^K w_t^k e^{g_t^k}\right) .\] It is well known (and will also follow from the analysis below) that the worst-case regret is \(\ln K\) for any horizon \(T \ge 1\).
In this post we are looking at an easier problem, obtained by dampening the gain of the best expert by some factor \(\lambda \in [0,1]\). That is, we consider the scaled-comparator regret \[R_T^\lambda ~=~ \lambda \operatorname*{\vphantom{p}max}_k \sum_{t=1}^T g_t^k - \sum_{t=1}^T \ln \left(\sum_{k=1}^K w_t^k e^{g_t^k}\right) .\] And in particular we are interested in the minimax regret, i.e. \[V_T^\lambda ~=~ \Big\langle\operatorname*{\vphantom{p}min}_{{\boldsymbol w}_t \in \triangle_K} \sup_{{\boldsymbol g}_t \in \mathbb R_+^K}\Big\rangle_{t=1}^T~ R_T^\lambda % \setl{ % \lambda \max_k \sum_{t=1}^T \gain_t^k % - \sum_{t=1}^T \ln \dell{\sum_k w_t^k e^{\gain_t^k}} % }\]
The main result of this post is
Proposition 1. The minimax regret of the \(K\ge 1\) experts, \(T \ge 1\) round scaled-comparator regret problem is \[V_T^\lambda ~=~ \mathop{\mathrm{kl}}(\lambda, \lambda \wedge 1/K) ,\] where \(\mathop{\mathrm{kl}}(x,y) = x \ln \frac{x}{y} + (1-x) \ln \frac{1-x}{1-y}\) is the Bernoulli relative entropy from \(x\) to \(y\), and \(x \wedge y = \operatorname*{\vphantom{p}min}\left\{x,y\right\}\) is the binary minimum.
A few remarks are in place.
For \(\lambda=1\) we recover the classical \(V_T^1 = \ln K\) result, which holds for the exponential weights strategy. As we will see below, we will need a different strategy for \(\lambda <1\).
For \(\lambda \le 1/K\), we find \(V_T^\lambda = 0\). At first sight this seems counter intuitive. But on second thought, by Jensen’s inequality \[\ln \left(\sum_{k=1}^K w_t^k e^{g_t^k}\right) ~\ge~ \sum_{k=1}^K w_t^k g_t^k\] so that by always playing uniform weights \(w_t^k = 1/K\), the learner ensures gain at least that of the best expert divided by \(K\). Hence the scaled-comparator regret is negative. On the other hand, the adversary can hand out no gain \({\boldsymbol g}_t = {\boldsymbol 0}\) and ensure the regret is zero. All in all, indeed, \(V_T^\lambda = 0\) for \(\lambda \le 1/K\).
As we will see, the strategy for the learner is independent of the horizon \(T\). The learner can hence ensure that \(R_T^\lambda \le \mathop{\mathrm{kl}}(\lambda, \lambda \wedge 1/K)\) at any time \(T = 1, 2, \ldots\).
Recall that we are after the \(T\)-round minimax scaled-comparator regret \[V_T^\lambda ~:=~ \left\langle\operatorname*{\vphantom{p}min}_{{\boldsymbol w}_t \in \triangle_K} \sup_{{\boldsymbol g}_t \in \mathbb R_+^K}\right\rangle_{t=1}^T \left\{ \lambda \operatorname*{\vphantom{p}max}_k \sum_{t=1}^T g_t^k - \sum_{t=1}^T \ln \left(\sum_k w_t^k e^{g_t^k}\right) \right\} .\] Let us introduce the following shorthand. After having played \(t<T\) rounds with cumulative expert gains \({\boldsymbol G}_t = \sum_{s=1}^t {\boldsymbol g}_s\), let us define the value-to-go of the remaining \(T-t \ge 1\) rounds as \begin{align} \label{eq:vtg} V_{t,T}^\lambda({\boldsymbol G}_t) ~=~ \left\langle\operatorname*{\vphantom{p}min}_{{\boldsymbol w}_s \in \triangle_K} \sup_{{\boldsymbol g}_s \in \mathbb R_+^K}\right\rangle_{s=t+1}^T \left\{ \lambda \operatorname*{\vphantom{p}max}_k \left(G_t^k + \sum_{s=t+1}^T g_s^k\right) - \sum_{s=t+1}^T \ln \left(\sum_k w_s^k e^{g_s^k}\right) \right\} .\end{align} We will show the following.
Proposition 2. The value-to-go \(\eqref{eq:vtg}\) satisfies \(V_{t,T}^\lambda({\boldsymbol G}_T) = V^\lambda({\boldsymbol G}_t)\) where \begin{align} \label{eq:primal} V^\lambda({\boldsymbol G}) ~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \operatorname*{\vphantom{p}max}_k \left\{ \lambda G^k + \mathop{\mathrm{kl}}(\lambda, \lambda \wedge w^k) \right\} \end{align}
Proposition 2 is proved in the next two subsections. Proposition 1 follows from it by considering \(t=0\) where \({\boldsymbol G}_0 = {\boldsymbol 0}\). For, by symmetry, \(V^\lambda({\boldsymbol 0}) = \mathop{\mathrm{kl}}(\lambda, \lambda \wedge 1/K)\) as desired.
Note how Proposition 2 implies that all dependence on the past rounds is summarised purely by \({\boldsymbol G}\), while the number of future rounds remaining is immaterial.
We will also prove and use the alternative expressions: \begin{align} V^\lambda({\boldsymbol G}) &~=~ \label{eq:sec} \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \operatorname*{\vphantom{p}max}_k \sup_{{\boldsymbol g}\in \mathbb R_+^K} \left\{ \lambda (G_k + g_k) - \ln \left(\sum_k w_k e^{g_k}\right) \right\} \\ \label{eq:entropy} &~=~ \operatorname*{\vphantom{p}max}_{{\boldsymbol q}\in \triangle_K}~ \lambda \sum_k q_k G^k + \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \sum_k q_k \mathop{\mathrm{kl}}(\lambda, \lambda \wedge w^k) \\ \label{eq:ih} &~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \sup_{{\boldsymbol g}\in \mathbb R_+^K}\left\{ V^\lambda({\boldsymbol G}+{\boldsymbol g}) - \ln \left(\sum_k w^k e^{g^k}\right) \right\}\end{align}
The expressions \(\eqref{eq:sec}\) and \(\eqref{eq:ih}\) will be proved in the base case and induction steps respectively. The expression \(\eqref{eq:entropy}\) is provided here purely for insight. It is obtained from \(\eqref{eq:primal}\) by convexifying the inner problem \(\operatorname*{\vphantom{p}max}_k \cdot\) to \(\operatorname*{\vphantom{p}max}_{{\boldsymbol q}\in \triangle_K} \operatorname*{\mathbb E}_{k \sim {\boldsymbol q}}\left[\cdot\right]\) and swapping \(\operatorname*{\vphantom{p}min}_{{\boldsymbol w}}\) and \(\operatorname*{\vphantom{p}max}_{{\boldsymbol q}}\). It reveals that \(V^\lambda({\boldsymbol G})\) is convex in \({\boldsymbol G}\), as it is a max of linear functions. Moreover, it is increasing in each entry of \({\boldsymbol G}\), and transmits additive constants scaled by \(\lambda\), i.e. \(V^\lambda({\boldsymbol G}+ c {\boldsymbol 1}) = V^\lambda({\boldsymbol G}) + \lambda c\). Finally, it establishes \(V^\lambda({\boldsymbol G})\) as a regularised maximum of \(\lambda {\boldsymbol G}\), with the regulariser (or “entropy”) defined on the probability simplex by \[{\boldsymbol q}~\mapsto~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \sum_k q_k \mathop{\mathrm{kl}}(\lambda, \lambda \wedge w^k) .\]
Let’s start at the end, with the last round. By definition \(\eqref{eq:vtg}\), \[V_{T-1,T}^\lambda({\boldsymbol G}_{T-1}) ~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}_T \in \triangle_K} \sup_{{\boldsymbol g}_T \in \mathbb R_+^K} \left\{ \lambda \operatorname*{\vphantom{p}max}_k (G_{T-1}^k + g_T^k) - \ln \left(\sum_k w_T^k e^{g_T^k}\right) \right\}\] taking the \(\operatorname*{\vphantom{p}max}_k\) to the front results in \[V_{T-1,T}^\lambda({\boldsymbol G}_{T-1}) ~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}_T \in \triangle_K} \operatorname*{\vphantom{p}max}_k \sup_{{\boldsymbol g}_T \in \mathbb R_+^K} \left\{ \lambda (G_{T-1}^k + g_T^k) - \ln \left(\sum_k w_T^k e^{g_T^k}\right) \right\}\] which, upon proving \(\eqref{eq:primal}\) below, proves \(\eqref{eq:sec}\). Plugging in the worst-case \(g_T^j=0\) for \(j \neq k\) further results in \[V_{T-1,T}^\lambda({\boldsymbol G}_{T-1}) ~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}_T \in \triangle_K} \operatorname*{\vphantom{p}max}_k \sup_{g_T^k \in \mathbb R_+} \left\{ \lambda (G_{T-1}^k + g_T^k) - \ln \left(w_T^k e^{g_T^k} + (1-w_T^k)\right) \right\}\] The best choice for \(g_T^k\) is \[g_T^k ~=~ \operatorname*{\vphantom{p}max}\left\{ 0, \ln \frac{ \lambda (1-w_T^k) }{ (1-\lambda) w_T^k } \right\}\] at which point the inner objective becomes \(\mathop{\mathrm{kl}}(\lambda, \lambda \wedge w_T^k)\). We hence end up with \[V_{T-1,T}^\lambda({\boldsymbol G}_{T-1}) ~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}_T \in \triangle_K} \operatorname*{\vphantom{p}max}_k \left\{ \lambda G_{T-1}^k + \mathop{\mathrm{kl}}(\lambda, \lambda \wedge w_T^k) \right\}\] which finally establishes the desired form \(\eqref{eq:primal}\).
Unpacking the definition of value-to-go \(\eqref{eq:vtg}\), we find \begin{align*} V_{t-1,T}^\lambda({\boldsymbol G}_{t-1}) % &~=~ % \min_{\w_t \in \triangle_K} \sup_{\vgain_t \in \mathbb R_+^K} % \tuplel{\min_{\w_s \in \triangle_K} \sup_{\vgain_s \in \mathbb R_+^K}}_{s=t+1}^T % \setl{ % \lambda \max_k \dell{G_{t-1}^k + \sum_{s=t}^T \gain_s^k} % - \sum_{s=t+1}^T \ln \dell{\sum_k w_s^k e^{\gain_s^k}} % } % - \ln \dell{\sum_k w_t^k e^{\gain_t^k}} % \\ &~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}_t \in \triangle_K} \sup_{{\boldsymbol g}_t \in \mathbb R_+^K} V_{t,T}^\lambda({\boldsymbol G}_{t-1} + {\boldsymbol g}_t) - \ln \left(\sum_k w_t^k e^{g_t^k}\right) \end{align*} To show that \(V_{t-1,T}^\lambda({\boldsymbol G}_{t-1}) = V^\lambda({\boldsymbol G}_{t-1})\), it hence remains to show the “minimax pullback” property \(\eqref{eq:ih}\). We will prove both inequalities of \(\eqref{eq:ih}\) separately. The case \(\ge\) is obvious by taking \({\boldsymbol g}={\boldsymbol 0}\). For \(\le\), we expand the definition of \(V^\lambda\) \begin{align*} & \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \sup_{{\boldsymbol g}\in \mathbb R_+^K}\left\{ V^\lambda({\boldsymbol G}+{\boldsymbol g}) - \ln \left(\sum_k w_k e^{g_k}\right) \right\} \\ &~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \sup_{{\boldsymbol g}\in \mathbb R_+^K}\left\{ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}' \in \triangle_K} \operatorname*{\vphantom{p}max}_k \sup_{{\boldsymbol g}' \in \mathbb R_+^K} \left\{ \lambda (G_k + g_k + g'_k) - \ln \left(\sum_k w'_k e^{g'_k}\right) \right\} - \ln \left(\sum_k w_k e^{g_k}\right) \right\} \end{align*} Now if we take \({\boldsymbol w}'\) to be the exponential weights update of prior \({\boldsymbol w}\) encountering gain \({\boldsymbol g}\), i.e. \(w'_k \propto w_k e^{g_k}\), then we find \begin{align*} &~\le~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \sup_{{\boldsymbol g}\in \mathbb R_+^K}\left\{ \operatorname*{\vphantom{p}max}_k \sup_{{\boldsymbol g}' \in \mathbb R_+^K} \left\{ \lambda (G_k + g_k + g'_k) - \ln \left(\sum_k \frac{w_k e^{g_k}}{\sum_j w^j e^{g^j}} e^{g'_k}\right) \right\} - \ln \left(\sum_k w_k e^{g_k}\right) \right\} \\ &~=~ \operatorname*{\vphantom{p}min}_{{\boldsymbol w}\in \triangle_K} \operatorname*{\vphantom{p}max}_k \sup_{{\boldsymbol g}, {\boldsymbol g}' \in \mathbb R_+^K} \left\{ \lambda (G_k + g_k + g'_k) - \ln \left(\sum_k w_k e^{g_k + g'_k}\right) \right\} \end{align*} Finally merging \({\boldsymbol g}+ {\boldsymbol g}' \in \mathbb R_+^K\), we find, by \(\eqref{eq:sec}\) \begin{align*} &~=~ V^\lambda({\boldsymbol G}) \end{align*} as desired.
We have quantified how the regret game gets easier when the comparator gain is scaled by a factor \(\lambda \in [0,1]\). Interestingly, this crucially depends on the assumption that gains are positive. If gains may be negative as well, then scaling by a factor \(\lambda < 1\) may make them higher, thus boosting the comparator gain. In turn, we find that for two-sided gains the minimax regret is infinite unless \(\lambda = 1\). Yet another interesting combination is negative gains (which we may rephrase as positive losses) with scale factor \(\lambda \ge 1\). Here we find that the minimax scaled-comparator regret is \(\ln K\) for any \(\lambda \ge 1\). Even though \(\lambda >1\) does hamper the comparator, the game does not get any easier.