Mixability Gap Bounds

In this post I collect a series of bounds on the so-called mixability gap. The mixability gap is an important ingredient of the analysis (regret bound) of the Hedge algorithm (see (De Rooij et al. 2014) for more details).

Fix a number of dimensions \(K \ge 1\), a probability distribution \(\boldsymbol w\in \triangle_K\), a loss vector \(\boldsymbol \ell\in [0,1]^K\) and a learning rate \(\eta \ge 0\). The mixability gap with these ingredients is defined as the difference \[\delta ~=~ h - m^\eta\] between the linear loss \[h ~=~ \sum_k w_k \ell_k\] and the \(\eta\)-mix loss \[m^\eta ~=~ \frac{-1}{\eta} \ln \sum_k w_k e^{-\eta \ell_k} .\] What bounds can we give on \(\delta\)? The bounds and their implication relation are shown in the figure below. For clarity, I used two abbreviations: \(r_k = h - \ell_k\) and \(\psi(x) = e^x - x -1\).

Proofs

Proof Applying Jensen’s inequality to the concave logarithm results in \(m^\eta \le h\) and hence \(\delta \ge 0\). \(\Box\)

Proposition 2. \(\delta \le \frac{1}{\eta} \ln \left((1-h)e^{\eta h} + h e^{\eta(h-1)}\right)\).

Proof By convexity of the exponential we have \(e^{- \eta \ell} \le (1-\ell) + \ell e^{-\eta}\) for any \(\ell\in [0,1]\). So we have \[\sum_k w_k e^{-\eta \ell_k} ~\le~ \sum_k w_k \left((1-\ell_k) + \ell_k e^{- \eta}\right) ~=~ 1-h + h e^{-\eta} ,\] and as a result \begin{equation}\label{eq:delta.after.cvx} \delta ~=~ h - m^\eta ~\le~ h + \frac{1}{\eta} \ln (1 - h + h e^{-\eta}) ~=~ \frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right) . \end{equation} The above bound is tight for experts with binary losses in \(\{0,1\}\). \(\Box\)

Proposition 3. \(\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right) \le \frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta}\).

Proof As can be seen from the rightmost equality of \(\eqref{eq:delta.after.cvx}\), the left-hand side is a concave function of \(h\). Its derivative is zero at \(h=1 -\left(\frac{1}{\eta }-\frac{1}{e^{\eta }-1}\right)\). Moreover, this optimiser satisfies \(h \in [\frac{1}{2}, 1]\). The value at this optimiser equals the right-hand side. \(\Box\)

Proposition 4. \(\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} \le \frac{\eta}{8}\).

Proof Let \(\operatorname{KL}_2(p\|h) = p \ln \frac{p}{h} + (1-p) \ln \frac{1-p}{1-h}\) be the binary relative entropy. By Pinsker’s Inquality \(\operatorname{KL}_2(p\|h) \ge 2 (p-h)^2\). So \begin{align} \notag \frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} &~=~ \notag \max_h~ h + \frac{1}{\eta} \ln (1-h+h e^{-\eta}) \\ &~=~ \label{eq:withKL} \max_{h,p}~ (h-p) -\frac{1}{\eta} \operatorname{KL}_2(p\|h) \\ \notag &~\le~ \max_{h,p}~ (h-p) - \frac{2}{\eta} (h-p)^2 \\ \notag &~\le~ \frac{\eta}{8}\end{align} where the last inequality follows from plugging in the unconstrained maximiser \(|h-p|=\frac{\eta}{4}\). \(\Box\)

Propositions 2 to 4 establish that \(\delta \le \frac{\eta}{8}\). This result is called Hoeffding’s Inequality in (Cesa-Bianchi and Lugosi 2006, Lemma A.1.1).

Proposition 5. \(\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} \le 1\).

Proof It is clear from \(\eqref{eq:withKL}\) that the left-hand side is increasing in \(\eta\). Taking the limit of \(\eta \to \infty\) results in \(1\). \(\Box\)

Proposition 6. \(\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right) \le \frac{\psi(-\eta)}{\eta} h\).

Proof The objective is concave in \(h\). The upper bound is the tangent at \(h=0\). \(\Box\)

Propositions 2 and 6 establish \(\delta \le \frac{\psi(-\eta)}{\eta} h\), which we may reorganise to to the perhaps familiar \[h ~\le~ \frac{\eta}{1 -e^{-\eta}} m^\eta .\]

Proposition 7. \(\frac{\psi(-\eta)}{\eta} \le 1\) and \(\frac{\psi(-\eta)}{\eta} \le \frac{\eta}{2}\).

Proof The function \(\frac{\psi(-\eta)}{\eta}\) is increasing in \(\eta\) with value \(1\) for \(\eta \to \infty\). It is also concave in \(\eta\), and hence bounded by its tangent at \(\eta=0\), which is \(\frac{\eta}{2}\). \(\Box\)

Proposition 8. \(\delta \le \frac{1}{\eta} \sum_k w_k \psi(\eta r_k) \le \frac{\psi(\eta)}{\eta} \sum_k w_k r_k^2\).

Proof We show in (Koolen, Van Erven, and Grünwald 2014, Lemma C.4) that \[\delta ~=~ \min_\lambda \frac{1}{\eta} \sum_k w_k \psi\left(\eta(\lambda - \ell_t^k)\right)\] and hence the first bound follows by plugging in the sub-optimal choice \(\lambda = h\). The second uses that \(\frac{\psi(x)}{x^2}\) is increasing, from which it follows that \(\psi(\eta r_k) \le r_k^2 \psi(\eta)\). \(\Box\)

Proof Recall that \(r_k = h - \ell_k\). Applying Jensen to the convex quadratic results in \[\sum_k w_k (r_k)^2 ~\le~ \sum_k w_k \left( (1-\ell_k) (h - 0)^2 + \ell_k (h - 1)^2 \right) ~=~ (1-h) h\] as required. \(\Box\)

Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press.

Koolen, Wouter M., Tim van Erven, and Peter D. Grünwald. 2014. “Learning the Learning Rate for Prediction with Expert Advice.” In Advances in Neural Information Processing Systems (NIPS) 27, 2294–2302. http://papers.nips.cc/paper/5241-learning-the-learning-rate-for-prediction-with-expert-advice.pdf.

Rooij, Steven de, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. 2014. “Follow the Leader If You Can, Hedge If You Must.” Journal of Machine Learning Research 15 (April): 1281–1316. http://jmlr.org/papers/volume15/rooij14a/rooij14a.pdf.

Introduction

Proofs