# The Relative Entropy Bound for Squint

2015-08-13

Abstract

We derive the relative entropy version of the second-order quantile regret bound for Squint.

# Introduction

I want to talk about Squint, a new prediction strategy Tim van Erven and I recently discovered (see (Koolen and Van Erven 2015)). Squint operates in the following protocol, called the Hedge setting. Prediction proceeds in rounds. In round $$t$$ the learner plays a probability vector $$\boldsymbol{w}_t$$ on experts $$k \in \{1, \ldots, K\}$$. Then the adversary reveals the vector $$\boldsymbol{\ell}_t \in [0,1]^K$$ of expert losses and the learner incurs the dot loss $$\boldsymbol{w}_t^\intercal\boldsymbol{\ell}_t$$.

#### Squint

Let us write $$r_t^k := (\boldsymbol{w}_t - \boldsymbol{e}_k)^\intercal\boldsymbol{\ell}_t$$ for the instantaneous regret w.r.t. expert $$k$$ in round $$t$$. Also let us abbreviate the cumulative regret to $$R_T^k := \sum_{t=1}^T r_t^k$$ and the cumulative regret squared to $$V_T^k := \sum_{t=1}^T (r_t^k)^2$$. The Squint strategy is based on (motivated by) the potential function $\Phi_T ~:=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \bigg[ e^{\eta R_T^k - \eta^2 V_T^k} -1 \bigg] ,$ where $$\pi$$ is a prior on experts $$k \in \{1, \ldots, K\}$$ and $$\gamma$$ is a prior on learning rates $$\eta \in [0,1/2]$$. The Squint weights $\boldsymbol{w}_t ~=~ \frac{ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ \eta \boldsymbol{e}_k e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \right] }{ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ \eta e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \right] }$ are chosen to make the potential decreasing $$0 = \Phi_0 \ge \Phi_1 \ge \ldots$$ To see this, we write \begin{align*} \Phi_{t} - \Phi_{t-1} &~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} (e^{\eta r_t^k - (\eta r_t^k)^2}-1) \right] \\ &~\le~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \eta r_t^k \right] \\ &~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \left[ e^{\eta R_{t-1}^k - \eta^2 V_{t-1}^k} \eta (\boldsymbol{w}_t - \boldsymbol{e}_k) \right]^\intercal\boldsymbol{\ell}_t \\ &~=~ 0 \end{align*} where the inequality uses that $$e^{x-x^2} \le 1 + x$$ for $$x \ge -1/2$$ and the last equality is due to the choice of weights $$\boldsymbol{w}_t$$.

#### Bounds

In the paper (Koolen and Van Erven 2015) we show that a negative potential implies quantile regret bounds. We show that for any subset $$\mathcal K$$ of experts, with abbreviations $$R_T^\mathcal K= \mathop{\mathrm{\mathbb E}}_{\pi(k|\mathcal K)}[R_T^k]$$ and $$V_T^\mathcal K= \mathop{\mathrm{\mathbb E}}_{\pi(k|\mathcal K)}[V_T^k]$$, we have $$\label{eq:orig} R_T^\mathcal K~\preceq~ \sqrt{V_T^\mathcal K\left(- \ln \pi(\mathcal K) + \ln \ln T\right)} .$$

Now what happens if we do not want to compare to distributions of the form $$\pi(k|\mathcal K)$$ but to a general distribution $$\rho(k)$$? Intuitively, using abbreviations $$R_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [R_T^k]$$ and $$V_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [ V_T^k]$$, we should be able to prove a relative entropy regret bound $$\label{eq:newbd} R_T^\rho ~\preceq~ \sqrt{V_T^\rho \left(\operatorname{KL}\left(\rho\middle\|\pi\right) + \ln \ln T\right)} .$$ Now $$\eqref{eq:newbd}$$ is a proper generalisation of $$\eqref{eq:orig}$$, as applying it to $$\rho(k) = \pi(k|\mathcal K)$$ gives $\operatorname{KL}\left(\rho\middle\|\pi\right) ~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k|\mathcal K)} \left[\ln \frac{\pi(k|\mathcal K)}{\pi(k)}\right] ~=~ - \ln \pi(\mathcal K)$ When we wrote the paper, we restricted to bounds of the form $$\eqref{eq:orig}$$ for simplicity. In related work, (Luo and Schapire 2015) give relative entropy bounds for a similar algorithm. Now several people have asked us whether bounds $$\eqref{eq:newbd}$$ actually hold for Squint. They do hold, for unmodified Squint, without any additional effort. In this post I will show how to derive them.

Relative entropy bounds are typically based on the following variational inequality. For distributions $$\pi(k)$$ and $$\rho(k)$$ and random variable $$X_k$$, $$\label{eq:PAC} \mathop{\mathrm{\mathbb E}}_{\pi(k)} \left[e^{X_k}\right] ~\ge~ \mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[ \frac{\pi(k)}{\rho(k)} e^{X_k}\right] ~=~ \mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[ e^{\ln \frac{\pi(k)}{\rho(k)} + X_k}\right] %\\ ~\ge~ e^{\mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[ \ln \frac{\pi(k)}{\rho(k)} + X_k\right]} ~=~ e^{-\operatorname{KL}\left(\rho\middle\|\pi\right) + \mathop{\mathrm{\mathbb E}}_{\rho(k)} \left[X_k\right]} .$$ The first inequality is strict if $$\rho(k)$$ has smaller support than $$\pi(k)$$. The second is Jensen’s Inequality.

# Relative Entropy Bound for Proper Priors

We first do the analysis for proper priors, as it is easier. Let $$\gamma$$ be a proper prior. Then $$\Phi_T \le 0$$ implies $\mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \bigg[ e^{\eta R_T^k - \eta^2 V_T^k} \bigg] ~\le~ 1 .$ Now take $$\rho(k)$$ any comparator distribution on experts. Then by $$\eqref{eq:PAC}$$, $\mathop{\mathrm{\mathbb E}}_{\pi(k)\gamma(\eta)} \bigg[ e^{\eta R_T^k - \eta^2 V_T^k} \bigg] ~\ge~ e^{ - \operatorname{KL}\left(\rho\middle\|\pi\right) } \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ e^{ \eta R_T^\rho - \eta^2 V_T^\rho } \right]$ where we abbreviated $$R_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [R_T^k]$$ and $$V_T^\rho = \mathop{\mathrm{\mathbb E}}_{\rho(k)} [V_T^k]$$. We therefore find $e^{ - \operatorname{KL}\left(\rho\middle\|\pi\right) } \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ e^{ \eta R_T^\rho - \eta^2 V_T^\rho } \right] ~\le~ 1$ from which it follows that for any $$\eta$$ with mass $$\gamma(\eta)$$ we have $\eta R_T^\rho - \eta^2 V_T^\rho ~\le~ \operatorname{KL}\left(\rho\middle\|\pi\right) - \ln \gamma(\eta)$ Plugging in $$\hat \eta = \frac{R_T^\rho}{2 V_T^\rho}$$, we obtain $R_T^\rho ~\le~ 2 \sqrt{ V_T^\rho \left( \operatorname{KL}\left(\rho\middle\|\pi\right) - \ln \gamma(\hat\eta) \right) } .$ In the case that $$\gamma$$ puts insufficient mass on $$\hat \eta$$ e.g. because it is a density, we need to bound the amount of mass close enough to $$\hat \eta$$. As this task is orthogonal to the question of relative entropy bounds, I refer back to our paper (Koolen and Van Erven 2015) where we show how to do this for two appropriate families of proper priors.

# Relative Entropy Bound for the Improper Prior

Let $$\gamma$$ be the improper prior with density $$\gamma(\textrm d\eta) = \frac{1}{\eta} \textrm d\eta$$. The motivation for this prior is that it gives good regret bounds while at the same time admitting closed-form integration of the weights, resulting in constant run time per expert per round. We start out with negative potential: $0 ~\ge~ \Phi_T ~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_0^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] .$ For any $$\epsilon \in [0,1/2]$$, we can split the integral $\Phi_T ~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_0^{\epsilon} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] + \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg]$ To bound the first term we use $$e^x \ge 1+x$$, $$R_T^k \ge -T$$ and $$V_T^k \le T$$ to find $\mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_0^{\epsilon} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] ~\ge~ \int_0^{\epsilon} - T - \eta T \,\textrm d\eta ~=~ - \left( \epsilon + \frac{\epsilon^2}{2}\right) T .$ To bound the second term, we use $$\eqref{eq:PAC}$$ to obtain \begin{align*} \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} - 1 }{ \eta } \textrm d\eta \bigg] &~=~ \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ e^{\eta R_T^k - \eta^2 V_T^k} }{ \eta } \textrm d\eta \bigg] - \mathop{\mathrm{\mathbb E}}_{\pi(k)} \bigg[ \int_\epsilon^{1/2} \frac{ 1 }{ \eta } \textrm d\eta \bigg] \\ &~\ge~ e^{- \operatorname{KL}(\rho\|\pi)} \int_\epsilon^{1/2} \frac{ e^{\eta R_T^\rho - \eta^2 V_T^\rho} }{ \eta } \textrm d\eta + \ln (2 \epsilon) . \end{align*} Plugging in the convenient choice $$\epsilon = \frac{1}{2(T+1)}$$ for which $$(\epsilon + \epsilon^2/2)T \le 1/2$$, we now arrive at $e^{- \operatorname{KL}(\rho\|\pi)} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R_T^\rho - \eta^2 V_T^\rho} }{ \eta } \textrm d\eta ~\le~ \frac{1}{2} + \ln (T+1) .$ To bound the integral, we need a case analysis, based on $$R = R_T^\rho$$ and $$V = V_T^\rho$$. There are three cases.

1. First the important case. If $$\hat \eta = \frac{R}{2 V}$$ satisfies $$\hat \eta - \frac{1}{\sqrt{2 V}} \ge \frac{1}{2(T+1)}$$ and $$\hat \eta \le 1/2$$, we find \begin{align*} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R - \eta^2 V} }{ \eta } \textrm d\eta &~\ge~ \int_{\hat \eta - \frac{1}{\sqrt{2 V}}}^{\hat \eta} \frac{ e^{(\hat \eta - \frac{1}{\sqrt{2 V}}) R - (\hat \eta - \frac{1}{\sqrt{2 V}})^2 V} }{ \eta } \textrm d\eta ~=~ e^{\frac{R^2}{4 V} - \frac{1}{2}} %e^{(\hat \eta - \frac{1}{\sqrt{2 V}}) R - (\hat \eta - \frac{1}{\sqrt{2 V}})^2 V} \ln \frac{1}{1 - \frac{\sqrt{2 V}}{R}} \\ &~\ge~ e^{\frac{1}{2} \left(\frac{R}{\sqrt{2 V}} - 1\right)^2} e^{\frac{R}{\sqrt{2 V}}-1} \frac{\sqrt{2 V}}{R} ~\ge~ e^{\frac{1}{2} \left(\frac{R}{\sqrt{2 V}} - 1\right)^2} \end{align*} In this case, $e^{- \operatorname{KL}\left(\rho\middle\|\pi\right) + \frac{1}{2} \left(\frac{R}{\sqrt{2 V}} - 1\right)^2} ~\le~ \frac{1}{2} + \ln (T+1)$ from which it follows that $R ~\le~ \sqrt{2 V} \left( 1 + \sqrt{2 \left( \operatorname{KL}\left(\rho\middle\|\pi\right) + \ln \left( \frac{1}{2} + \ln (T+1) \right) \right) } \right) .$

2. Second, if $$\hat \eta - \frac{1}{\sqrt{2 V}} < \frac{1}{2(T+1)}$$ we conclude $R ~<~ \frac{V}{T+1} + \sqrt{2 V} ~<~ 1 + \sqrt{2 V} .$

3. Finally, if $$\hat \eta > 1/2$$ we find $$R > V$$, and hence for any $$u \in [\frac{1}{2(T+1)}, 1/2]$$, \begin{align*} \int_{\frac{1}{2(T+1)}}^{1/2} \frac{ e^{\eta R - \eta^2 V} }{ \eta } \textrm d\eta &~\ge~ \int_{u}^{1/2} \frac{ e^{u R - u^2 V} }{ \eta } \textrm d\eta ~=~ e^{u R - u^2 V} \ln \frac{1}{2 u} ~\ge~ e^{u (1-u)R} \ln \frac{1}{2 u} \end{align*} so that $R ~\le~ \frac{1}{u (1-u)} \left( \operatorname{KL}(\rho\|\pi) + \ln \frac{ \frac{1}{2} + \ln (T+1) }{ - \ln (2 u) } \right) .$ Plugging in $$u = \frac{5 - \sqrt{5}}{10}$$ (which is sufficiently large for all $$T \ge 1$$) we get $R ~\le~ 5 \big( \operatorname{KL}(\rho\|\pi) + \ln \left( 1 + 2\ln (T+1) \right) \big)$

Putting all three cases together, we find $\begin{gathered} R_T^\rho ~\le~ \sqrt{2 V_T^\rho} \left( 1 + \sqrt{2 \left( \operatorname{KL}\left(\rho\middle\|\pi\right) + \ln \left( \frac{1}{2} + \ln (T+1) \right) \right) } \right) \\ + 1 + 5 \big( \operatorname{KL}(\rho\|\pi) + \ln \left( 1 + 2\ln (T+1) \right) \big),\end{gathered}$ which is the relative entropy upgrade of (Koolen and Van Erven 2015, Theorem 4).

Koolen, Wouter M., and Tim van Erven. 2015. “Second-Order Quantile Methods for Experts and Combinatorial Games.” In Proceedings of the 28th Annual Conference on Learning Theory (COLT), edited by Peter Grünwald, Elad Hazan, and Satyen Kale, 1155–75. http://jmlr.org/proceedings/papers/v40/Koolen15a.pdf.
Luo, Haipeng, and Robert E. Schapire. 2015. “Achieving All with No Parameters: Adaptive NormalHedge.” In Proceedings of the 28th Annual Conference on Learning Theory (COLT), edited by Peter Grünwald, Elad Hazan, and Satyen Kale, 1286–1304. http://jmlr.org/proceedings/papers/v40/Luo15.pdf.