2016-01-13

In this post I collect a series of bounds on the so-called mixability gap. The mixability gap is an important ingredient of the analysis (regret bound) of the Hedge algorithm (see (De Rooij et al. 2014) for more details).

Fix a number of dimensions \(K \ge 1\), a probability distribution \(\boldsymbol w\in \triangle_K\), a loss vector \(\boldsymbol \ell\in [0,1]^K\) and a learning rate \(\eta \ge 0\). The *mixability gap* with these ingredients is defined as the difference \[\delta
~=~
h - m^\eta\] between the *linear loss* \[h
~=~
\sum_k w_k \ell_k\] and the *\(\eta\)-mix loss* \[m^\eta
~=~
\frac{-1}{\eta} \ln \sum_k w_k e^{-\eta \ell_k}
.\] What bounds can we give on \(\delta\)? The bounds and their implication relation are shown in the figure below. For clarity, I used two abbreviations: \(r_k = h - \ell_k\) and \(\psi(x) = e^x - x -1\).

**Proposition 1.** *\(0 \le \delta\).*

*Proof* Applying Jensen’s inequality to the concave logarithm results in \(m^\eta \le h\) and hence \(\delta \ge 0\). \(\Box\)

**Proposition 2.** *\(\delta \le \frac{1}{\eta} \ln \left((1-h)e^{\eta h} + h e^{\eta(h-1)}\right)\).*

*Proof* By convexity of the exponential we have \(e^{- \eta \ell} \le (1-\ell) + \ell e^{-\eta}\) for any \(\ell\in [0,1]\). So we have \[\sum_k w_k e^{-\eta \ell_k}
~\le~
\sum_k w_k \left((1-\ell_k) + \ell_k e^{- \eta}\right)
~=~
1-h + h e^{-\eta}
,\] and as a result \begin{equation}\label{eq:delta.after.cvx}
\delta
~=~
h - m^\eta
~\le~
h + \frac{1}{\eta} \ln (1 - h + h e^{-\eta})
~=~
\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right)
.
\end{equation} The above bound is tight for experts with binary losses in \(\{0,1\}\). \(\Box\)

**Proposition 3.** *\(\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right)
\le
\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta}\).*

*Proof* As can be seen from the rightmost equality of \(\eqref{eq:delta.after.cvx}\), the left-hand side is a concave function of \(h\). Its derivative is zero at \(h=1 -\left(\frac{1}{\eta }-\frac{1}{e^{\eta }-1}\right)\). Moreover, this optimiser satisfies \(h \in [\frac{1}{2}, 1]\). The value at this optimiser equals the right-hand side. \(\Box\)

**Proposition 4.** *\(\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} \le \frac{\eta}{8}\).*

*Proof* Let \(\operatorname{KL}_2(p\|h) = p \ln \frac{p}{h} + (1-p) \ln \frac{1-p}{1-h}\) be the binary relative entropy. By Pinsker’s Inquality \(\operatorname{KL}_2(p\|h) \ge 2 (p-h)^2\). So \begin{align}
\notag
\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta}
&~=~
\notag
\max_h~ h + \frac{1}{\eta} \ln (1-h+h e^{-\eta})
\\
&~=~
\label{eq:withKL}
\max_{h,p}~ (h-p) -\frac{1}{\eta} \operatorname{KL}_2(p\|h)
\\
\notag
&~\le~
\max_{h,p}~ (h-p) - \frac{2}{\eta} (h-p)^2
\\
\notag
&~\le~
\frac{\eta}{8}\end{align} where the last inequality follows from plugging in the unconstrained maximiser \(|h-p|=\frac{\eta}{4}\). \(\Box\)

Propositions 2 to 4 establish that \(\delta \le \frac{\eta}{8}\). This result is called Hoeffding’s Inequality in (Cesa-Bianchi and Lugosi 2006 Lemma A.1.1).

**Proposition 5.** *\(\frac{1}{e^{\eta }-1}+\frac{\ln \frac{e^{\eta }-1}{\eta}-1}{\eta} \le 1\).*

*Proof* It is clear from \(\eqref{eq:withKL}\) that the left-hand side is increasing in \(\eta\). Taking the limit of \(\eta \to \infty\) results in \(1\). \(\Box\)

**Proposition 6.** *\(\frac{1}{\eta} \ln \left((1-h) e^{\eta h} + h e^{\eta(h-1)}\right)
\le
\frac{\psi(-\eta)}{\eta} h\).*

*Proof* The objective is concave in \(h\). The upper bound is the tangent at \(h=0\). \(\Box\)

Propositions 2 and 6 establish \(\delta \le \frac{\psi(-\eta)}{\eta} h\), which we may reorganise to to the perhaps familiar \[h ~\le~ \frac{\eta}{1 -e^{-\eta}} m^\eta .\]

**Proposition 7.** *\(\frac{\psi(-\eta)}{\eta}
\le
1\) and \(\frac{\psi(-\eta)}{\eta}
\le
\frac{\eta}{2}\).*

*Proof* The function \(\frac{\psi(-\eta)}{\eta}\) is increasing in \(\eta\) with value \(1\) for \(\eta \to \infty\). It is also concave in \(\eta\), and hence bounded by its tangent at \(\eta=0\), which is \(\frac{\eta}{2}\). \(\Box\)

**Proposition 8.** *\(\delta
\le
\frac{1}{\eta}
\sum_k w_k \psi(\eta r_k)
\le
\frac{\psi(\eta)}{\eta}
\sum_k w_k r_k^2\).*

*Proof* We show in (Koolen, Van Erven, and Grünwald 2014 Lemma C.4) that \[\delta
~=~
\min_\lambda \frac{1}{\eta} \sum_k w_k \psi\left(\eta(\lambda - \ell_t^k)\right)\] and hence the first bound follows by plugging in the sub-optimal choice \(\lambda = h\). The second uses that \(\frac{\psi(x)}{x^2}\) is increasing, from which it follows that \(\psi(\eta r_k) \le r_k^2 \psi(\eta)\). \(\Box\)

**Proposition 9.** *\(\sum_k w_k (r_k)^2
\le
h(1-h)\).*

*Proof* Recall that \(r_k = h - \ell_k\). Applying Jensen to the convex quadratic results in \[\sum_k w_k (r_k)^2
~\le~
\sum_k w_k \left( (1-\ell_k) (h - 0)^2 + \ell_k (h - 1)^2 \right)
~=~
(1-h) h\] as required. \(\Box\)

Cesa-Bianchi, Nicolò, and Gábor Lugosi. 2006. *Prediction, Learning, and Games*. Cambridge University Press.

Koolen, Wouter M., Tim van Erven, and Peter D. Grünwald. 2014. “Learning the Learning Rate for Prediction with Expert Advice.” In *Advances in Neural Information Processing Systems (Nips) 27*, 2294–2302. http://papers.nips.cc/paper/5241-learning-the-learning-rate-for-prediction-with-expert-advice.pdf.

Rooij, Steven de, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. 2014. “Follow the Leader If You Can, Hedge If You Must.” *Journal of Machine Learning Research* 15 (April): 1281–1316. http://jmlr.org/papers/volume15/rooij14a/rooij14a.pdf.