# Introduction

I want to talk about Squint, a new prediction strategy Tim van Erven and I recently discovered (see Koolen and Van Erven (2015)). Squint operates in the following protocol, called the Hedge setting. Prediction proceeds in rounds. In round $$t$$ the learner plays a probability vector $${\boldsymbol{w}}_t$$ on experts $$k \in \{1, \ldots, K\}$$. Then the adversary reveals the vector $${\boldsymbol{{\ell}}}_t \in [0,1]^K$$ of expert losses and the learner incurs the dot loss $${\boldsymbol{w}}_t^{\intercal}{\boldsymbol{{\ell}}}_t$$.

#### Squint

Let us write $$r_t^k := ({\boldsymbol{w}}_t - {\boldsymbol{e}}_k)^{\intercal}{\boldsymbol{{\ell}}}_t$$ for the instantaneous regret w.r.t. expert $$k$$ in round $$t$$. Also let us abbreviate the cumulative regret to $$R_T^k := \sum_{t=1}^T r_t^k$$ and the cumulative regret squared to $$V_T^k := \sum_{t=1}^T (r_t^k)^2$$. The Squint strategy is based on (motivated by) the potential function $\Phi_T ~:=~ \operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\bigg[ e^{\eta R_T^k - \eta^2 V_T^k} -1 \bigg]} ,$ where $$\pi$$ is a prior on experts $$k \in \{1, \ldots, K\}$$ and $$\gamma$$ is a prior on learning rates $$\eta \in [0,1/2]$$. The Squint weights ${\boldsymbol{w}}_t ~=~ \frac{ \operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\left[ \eta {\boldsymbol{e}}_k e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \right]} }{ \operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\left[ \eta e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \right]} }$ are chosen to make the potential decreasing $$0 = \Phi_0 \ge \Phi_1 \ge \ldots$$ To see this, we write \begin{align*} \Phi_{t} - \Phi_{t-1} &~=~ \operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} (e^{\eta r_t^k - (\eta r_t^k)^2}-1) \right]} \\ &~\le~ \operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \eta r_t^k \right]} \\ &~=~ \operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \eta ({\boldsymbol{w}}_t - {\boldsymbol{e}}_k) \right]}^{\intercal}{\boldsymbol{{\ell}}}_t \\ &~=~ 0 \end{align*} where the inequality uses that $$e^{x-x^2} \le 1 + x$$ for $$x \ge -1/2$$ and the last equality is due to the choice of weights $${\boldsymbol{w}}_t$$.

#### Bounds

In the paper Koolen and Van Erven (2015) we show that a negative potential implies quantile regret bounds. We show that for any subset $${\mathcal K}$$ of experts, with abbreviations $$R_T^{\mathcal K}= \operatorname*{\mathbb E}_{\pi(k|{\mathcal K})}[R_T^k]$$ and $$V_T^{\mathcal K}= \operatorname*{\mathbb E}_{\pi(k|{\mathcal K})}[V_T^k]$$, we have $$\label{eq:orig} R_T^{\mathcal K}~\preceq~ \sqrt{V_T^{\mathcal K}{\left(- \ln \pi({\mathcal K}) + \ln \ln T\right)}} .$$

Now what happens if we do not want to compare to distributions of the form $$\pi(k|{\mathcal K})$$ but to a general distribution $$\rho(k)$$? Intuitively, using abbreviations $$R_T^\rho = \operatorname*{\mathbb E}_{\rho(k)} [R_T^k]$$ and $$V_T^\rho = \operatorname*{\mathbb E}_{\rho(k)} [ V_T^k]$$, we should be able to prove a relative entropy regret bound $$\label{eq:newbd} R_T^\rho ~\preceq~ \sqrt{V_T^\rho {\left({\operatorname{KL}}{\left(\rho\middle\|\pi\right)} + \ln \ln T\right)}} .$$ Now $$\eqref{eq:newbd}$$ is a proper generalisation of $$\eqref{eq:orig}$$, as applying it to $$\rho(k) = \pi(k|{\mathcal K})$$ gives ${\operatorname{KL}}{\left(\rho\middle\|\pi\right)} ~=~ \operatorname*{\mathbb E}_{\pi(k|{\mathcal K})} {\left[\ln \frac{\pi(k|{\mathcal K})}{\pi(k)}\right]} ~=~ - \ln \pi({\mathcal K})$ When we wrote the paper, we restricted to bounds of the form $$\eqref{eq:orig}$$ for simplicity. In related work, Luo and Schapire (2015) give relative entropy bounds for a similar algorithm. Now several people have asked us whether bounds $$\eqref{eq:newbd}$$ actually hold for Squint. They do hold, for unmodified Squint, without any additional effort. In this post I will show how to derive them.

Relative entropy bounds are typically based on the following variational inequality. For distributions $$\pi(k)$$ and $$\rho(k)$$ and random variable $$X_k$$, $$\label{eq:PAC} \operatorname*{\mathbb E}_{\pi(k)} {\left[e^{X_k}\right]} ~\ge~ \operatorname*{\mathbb E}_{\rho(k)} {\left[ \frac{\pi(k)}{\rho(k)} e^{X_k}\right]} ~=~ \operatorname*{\mathbb E}_{\rho(k)} {\left[ e^{\ln \frac{\pi(k)}{\rho(k)} + X_k}\right]} ~\ge~ e^{\operatorname*{\mathbb E}_{\rho(k)} {\left[ \ln \frac{\pi(k)}{\rho(k)} + X_k\right]}} ~=~ e^{-{\operatorname{KL}}{\left(\rho\middle\|\pi\right)} + \operatorname*{\mathbb E}_{\rho(k)} {\left[X_k\right]}} .$$ The first inequality is strict if $$\rho(k)$$ has smaller support than $$\pi(k)$$. The second is Jensen’s Inequality.

# Relative Entropy Bound for Proper Priors

We first do the analysis for proper priors, as it is easier. Let $$\gamma$$ be a proper prior. Then $$\Phi_T \le 0$$ implies $\operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\bigg[ e^{\eta R_T^k - \eta^2 V_T^k} \bigg]} ~\le~ 1 .$ Now take $$\rho(k)$$ any comparator distribution on experts. Then by $$\eqref{eq:PAC}$$, $\operatorname*{\mathbb E}_{\pi(k)\gamma(\eta)} {\bigg[ e^{\eta R_T^k - \eta^2 V_T^k} \bigg]} ~\ge~ e^{ - {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} } \operatorname*{\mathbb E}_{\gamma(\eta)} {\left[ e^{ \eta R_T^\rho - \eta^2 V_T^\rho } \right]}$ where we abbreviated $$R_T^\rho = \operatorname*{\mathbb E}_{\rho(k)} [R_T^k]$$ and $$V_T^\rho = \operatorname*{\mathbb E}_{\rho(k)} [V_T^k]$$. We therefore find $e^{ - {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} } \operatorname*{\mathbb E}_{\gamma(\eta)} {\left[ e^{ \eta R_T^\rho - \eta^2 V_T^\rho } \right]} ~\le~ 1$ from which it follows that for any $$\eta$$ with mass $$\gamma(\eta)$$ we have $\eta R_T^\rho - \eta^2 V_T^\rho ~\le~ {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} - \ln \gamma(\eta)$ Plugging in $$\hat \eta = \frac{R_T^\rho}{2 V_T^\rho}$$, we obtain $R_T^\rho ~\le~ 2 \sqrt{ V_T^\rho {\left( {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} - \ln \gamma(\hat\eta) \right)} } .$ In the case that $$\gamma$$ puts insufficient mass on $$\hat \eta$$ e.g. because it is a density, we need to bound the amount of mass close enough to $$\hat \eta$$. As this task is orthogonal to the question of relative entropy bounds, I refer back to our paper Koolen and Van Erven (2015) where we show how to do this for two appropriate families of proper priors.

# Relative Entropy Bound for the Improper Prior

Let $$\gamma$$ be the improper prior with density $$\gamma({\textrm d}\eta) = \frac{1}{\eta} {\textrm d}\eta$$. The motivation for this prior is that it gives good regret bounds while at the same time admitting closed-form integration of the weights, resulting in constant run time per expert per round. We start out with negative potential: $0 ~\ge~ \Phi_T ~=~ \operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_0^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } {\textrm d}\eta \bigg]} .$ For any $$\epsilon \in [0,1/2]$$, we can split the integral $\Phi_T ~=~ \operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_0^{\epsilon} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } {\textrm d}\eta \bigg]} + \operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } {\textrm d}\eta \bigg]}$ To bound the first term we use $$e^x \ge 1+x$$, $$R_T^k \ge -T$$ and $$V_T^k \le T$$ to find $\operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_0^{\epsilon} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } {\textrm d}\eta \bigg]} ~\ge~ \int_0^{\epsilon} - T - \eta T \,{\textrm d}\eta ~=~ - {\left( \epsilon + \frac{\epsilon^2}{2}\right)} T .$ To bound the second term, we use $$\eqref{eq:PAC}$$ to obtain \begin{align*} \operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } {\textrm d}\eta \bigg]} &~=~ \operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} }{ \eta } {\textrm d}\eta \bigg]} - \operatorname*{\mathbb E}_{\pi(k)} {\bigg[ \int_\epsilon^{1/2} \frac{ 1 }{ \eta } {\textrm d}\eta \bigg]} \\ &~\ge~ e^{- {\operatorname{KL}}{(\rho\|\pi)}} \int_\epsilon^{1/2} \frac{ e^{\eta R_T^\rho - \eta^2 V_T^\rho} }{ \eta } {\textrm d}\eta + \ln (2 \epsilon) . \end{align*} Plugging in the convenient choice $$\epsilon = \frac{1}{2(T+1)}$$ for which $$(\epsilon + \epsilon^2/2)T \le 1/2$$, we now arrive at $e^{- {\operatorname{KL}}{(\rho\|\pi)}} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R_T^\rho - \eta^2 V_T^\rho} }{ \eta } {\textrm d}\eta ~\le~ \frac{1}{2} + \ln (T+1) .$ To bound the integral, we need a case analysis, based on $$R = R_T^\rho$$ and $$V = V_T^\rho$$. There are three cases.

1. First the important case. If $$\hat \eta = \frac{R}{2 V}$$ satisfies $$\hat \eta - \frac{1}{\sqrt{2 V}} \ge \frac{1}{2(T+1)}$$ and $$\hat \eta \le 1/2$$, we find \begin{align*} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R - \eta^2 V} }{ \eta } {\textrm d}\eta &~\ge~ \int_{\hat \eta - \frac{1}{\sqrt{2 V}}}^{\hat \eta} \frac{ e^{(\hat \eta - \frac{1}{\sqrt{2 V}}) R - (\hat \eta - \frac{1}{\sqrt{2 V}})^2 V} }{ \eta } {\textrm d}\eta ~=~ e^{\frac{R^2}{4 V} - \frac{1}{2}} \ln \frac{1}{1 - \frac{\sqrt{2 V}}{R}} \\ &~\ge~ e^{\frac{1}{2} {\left(\frac{R}{\sqrt{2 V}} - 1\right)}^2} e^{\frac{R}{\sqrt{2 V}}-1} \frac{\sqrt{2 V}}{R} ~\ge~ e^{\frac{1}{2} {\left(\frac{R}{\sqrt{2 V}} - 1\right)}^2} \end{align*} In this case, $e^{- {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} + \frac{1}{2} {\left(\frac{R}{\sqrt{2 V}} - 1\right)}^2} ~\le~ \frac{1}{2} + \ln (T+1)$ from which it follows that $R ~\le~ \sqrt{2 V} {\left( 1 + \sqrt{2 {\left( {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} + \ln {\left( \frac{1}{2} + \ln (T+1) \right)} \right)} } \right)} .$

2. Second, if $$\hat \eta - \frac{1}{\sqrt{2 V}} < \frac{1}{2(T+1)}$$ we conclude $R ~<~ \frac{V}{T+1} + \sqrt{2 V} ~<~ 1 + \sqrt{2 V} .$

3. Finally, if $$\hat \eta > 1/2$$ we find $$R > V$$, and hence for any $$u \in [\frac{1}{2(T+1)}, 1/2]$$, \begin{align*} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R - \eta^2 V} }{ \eta } {\textrm d}\eta &~\ge~ \int_{u}^{1/2} \frac{ e^{u R - u^2 V} }{ \eta } {\textrm d}\eta ~=~ e^{u R - u^2 V} \ln \frac{1}{2 u} ~\ge~ e^{u (1-u)R} \ln \frac{1}{2 u} \end{align*} so that $R ~\le~ \frac{1}{u (1-u)} {\left( {\operatorname{KL}}{(\rho\|\pi)} + \ln \frac{ \frac{1}{2} + \ln (T+1) }{ - \ln (2 u) } \right)} .$ Plugging in $$u = \frac{5 - \sqrt{5}}{10}$$ (which is sufficiently large for all $$T \ge 1$$) we get $R ~\le~ 5 {\big( {\operatorname{KL}}{(\rho\|\pi)} + \ln {\left( 1 + 2\ln (T+1) \right)} \big)}$

Putting all three cases together, we find $\begin{gathered} R_T^\rho ~\le~ \sqrt{2 V_T^\rho} {\left( 1 + \sqrt{2 {\left( {\operatorname{KL}}{\left(\rho\middle\|\pi\right)} + \ln {\left( \frac{1}{2} + \ln (T+1) \right)} \right)} } \right)} \\ + 1 + 5 {\big( {\operatorname{KL}}{(\rho\|\pi)} + \ln {\left( 1 + 2\ln (T+1) \right)} \big)},\end{gathered}$ which is the relative entropy upgrade of Koolen and Van Erven (2015 Theorem 4).

Koolen, Wouter M., and Tim van Erven. 2015. “Second-Order Quantile Methods for Experts and Combinatorial Games.” In Proceedings of the 28th Annual Conference on Learning Theory (COLT), edited by Peter Grünwald, Elad Hazan, and Satyen Kale, 1155–75. http://jmlr.org/proceedings/papers/v40/Koolen15a.pdf.

Luo, Haipeng, and Robert E. Schapire. 2015. “Achieving All with No Parameters: Adaptive NormalHedge.” In Proceedings of the 28th Annual Conference on Learning Theory (COLT), edited by Peter Grünwald, Elad Hazan, and Satyen Kale, 1286–1304. http://jmlr.org/proceedings/papers/v40/Luo15.pdf.