2016-01-13
A collection of bounds on the mixability gap.
In this post I collect a series of bounds on the so-called mixability gap. The mixability gap is an important ingredient of the analysis (regret bound) of the Hedge algorithm (see (De Rooij et al. 2014) for more details).
Fix a number of dimensions \(K \ge 1\), a probability distribution \(\boldsymbol w\in \triangle_K\), a loss vector \(\boldsymbol \ell\in [0,1]^K\) and a learning rate \(\eta \ge 0\). The mixability gap with these ingredients is defined as the difference \[\delta ~=~ h - m^\eta\] between the linear loss \[h ~=~ \sum_k w_k \ell_k\] and the \(\eta\)-mix loss \[m^\eta ~=~ \frac{-1}{\eta} \ln \sum_k w_k e^{-\eta \ell_k} .\] What bounds can we give on \(\delta\)? The bounds and their implication relation are shown in the figure below. For clarity, I used two abbreviations: \(r_k = h - \ell_k\) and \(\psi(x) = e^x - x -1\).
Proposition 1. \(0 \le \delta\).
Proof Applying Jensen’s inequality to the concave logarithm results in \(m^\eta \le h\) and hence \(\delta \ge 0\). \(\Box\)
Proposition 2. \(\delta \le \frac{1}{\eta} \ln \left((1-h)e^{\eta h} + h e^{\eta(h-1)}\right)\).
Proof By convexity of the exponential we have \(e^{- \eta \ell} \le (1-\ell) + \ell e^{-\eta}\) for any \(\ell\in [0,1]\). So we have \[\sum_k w_k e^{-\eta \ell_k} ~\le~ \sum_k w_k \left((1-\ell_k) + \ell_k e^{- \eta}\right) ~=~ 1-h + h e^{-\eta} ,\] and as a result \begin{equation}\label{eq:delta.after.cvx} \delta ~=~ h - m^\eta ~\le~ h + \frac{1}{\eta} \ln (1 - h + h e^{-\eta}) ~=~ \frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right) . \end{equation} The above bound is tight for experts with binary losses in \(\{0,1\}\). \(\Box\)
Proposition 3. \(\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right) \le \frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta}\).
Proof As can be seen from the rightmost equality of \(\eqref{eq:delta.after.cvx}\), the left-hand side is a concave function of \(h\). Its derivative is zero at \(h=1 -\left(\frac{1}{\eta }-\frac{1}{e^{\eta }-1}\right)\). Moreover, this optimiser satisfies \(h \in [\frac{1}{2}, 1]\). The value at this optimiser equals the right-hand side. \(\Box\)
Proposition 4. \(\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} \le \frac{\eta}{8}\).
Proof Let \(\operatorname{KL}_2(p\|h) = p \ln \frac{p}{h} + (1-p) \ln \frac{1-p}{1-h}\) be the binary relative entropy. By Pinsker’s Inquality \(\operatorname{KL}_2(p\|h) \ge 2 (p-h)^2\). So \begin{align} \notag \frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} &~=~ \notag \max_h~ h + \frac{1}{\eta} \ln (1-h+h e^{-\eta}) \\ &~=~ \label{eq:withKL} \max_{h,p}~ (h-p) -\frac{1}{\eta} \operatorname{KL}_2(p\|h) \\ \notag &~\le~ \max_{h,p}~ (h-p) - \frac{2}{\eta} (h-p)^2 \\ \notag &~\le~ \frac{\eta}{8}\end{align} where the last inequality follows from plugging in the unconstrained maximiser \(|h-p|=\frac{\eta}{4}\). \(\Box\)
Propositions 2 to 4 establish that \(\delta \le \frac{\eta}{8}\). This result is called Hoeffding’s Inequality in (Cesa-Bianchi and Lugosi 2006, Lemma A.1.1).
Proposition 5. \(\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} \le 1\).
Proof It is clear from \(\eqref{eq:withKL}\) that the left-hand side is increasing in \(\eta\). Taking the limit of \(\eta \to \infty\) results in \(1\). \(\Box\)
Proposition 6. \(\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right) \le \frac{\psi(-\eta)}{\eta} h\).
Proof The objective is concave in \(h\). The upper bound is the tangent at \(h=0\). \(\Box\)
Propositions 2 and 6 establish \(\delta \le \frac{\psi(-\eta)}{\eta} h\), which we may reorganise to to the perhaps familiar \[h ~\le~ \frac{\eta}{1 -e^{-\eta}} m^\eta .\]
Proposition 7. \(\frac{\psi(-\eta)}{\eta} \le 1\) and \(\frac{\psi(-\eta)}{\eta} \le \frac{\eta}{2}\).
Proof The function \(\frac{\psi(-\eta)}{\eta}\) is increasing in \(\eta\) with value \(1\) for \(\eta \to \infty\). It is also concave in \(\eta\), and hence bounded by its tangent at \(\eta=0\), which is \(\frac{\eta}{2}\). \(\Box\)
Proposition 8. \(\delta \le \frac{1}{\eta} \sum_k w_k \psi(\eta r_k) \le \frac{\psi(\eta)}{\eta} \sum_k w_k r_k^2\).
Proof We show in (Koolen, Van Erven, and Grünwald 2014, Lemma C.4) that \[\delta ~=~ \min_\lambda \frac{1}{\eta} \sum_k w_k \psi\left(\eta(\lambda - \ell_t^k)\right)\] and hence the first bound follows by plugging in the sub-optimal choice \(\lambda = h\). The second uses that \(\frac{\psi(x)}{x^2}\) is increasing, from which it follows that \(\psi(\eta r_k) \le r_k^2 \psi(\eta)\). \(\Box\)
Proposition 9. \(\sum_k w_k (r_k)^2 \le h(1-h)\).
Proof Recall that \(r_k = h - \ell_k\). Applying Jensen to the convex quadratic results in \[\sum_k w_k (r_k)^2 ~\le~ \sum_k w_k \left( (1-\ell_k) (h - 0)^2 + \ell_k (h - 1)^2 \right) ~=~ (1-h) h\] as required. \(\Box\)