2021-01-20

# Introduction

Consider random variables $$X_1, X_2, \ldots$$ that are mean-zero and $$X_i \in [\pm 1]$$. If we aim for simplicity we may assume that $$X$$ are i.i.d., while if we aim for generality we will assume they are conditionally bounded and zero-mean. Let us abbreviate the sum by $$S_t = \sum_{s=1}^t X_s$$ and the sum of squares by $$V_t = \sum_{s=1}^t X_s^2$$. Now fix $$\alpha \ge 0$$ and let us consider the process $$\label{eq:freegrad.mart} M_t ~=~ \exp \left( \frac{S_t^2}{2(V_t + \alpha |S_t|)} - \frac{1}{2} \ln V_t \right)$$ In this post we are asking the question: for which $$\alpha \ge 0$$ is $$M_t$$ a martingale? In the FreeGrad paper (Mhammedi and Koolen 2020), Zakaria Mhammedi and I show that $$\alpha \ge 1$$ is sufficient. But is it the best choice? In this post we find a slightly tighter $$\alpha$$. Namely, let $$\label{eq:boundary} \alpha^* ~=~ \frac{1}{12} \left(4+\sqrt{78} \sin \left(\frac{1}{3} \arctan\left(\frac{\sqrt{\frac{19}{2}}}{33}\right)\right)+\sqrt{26} \cos \left(\frac{1}{3} \arctan\left(\frac{\sqrt{\frac{19}{2}}}{33}\right)\right)\right) \approx 0.780891$$ be the rightmost root of the cubic $$64 - 45 x - 216 x^2 + 216 x^3$$. We will argue that $$\alpha^*$$ is sufficient.

## Argument

Let us abbreviate $M(S,V) ~=~ \exp \left( \frac{S^2}{2(V + \alpha |S|)} - \frac{1}{2} \ln V \right)$ We need to show that for every distribution on $$X \in [\pm 1]$$ with zero mean, we have $\operatorname*{\mathbb E}_X\left[M(S+X,V+X^2)\right] ~\le~ M(S,V)$ This is equivalent to showing that $\min_{\lambda \in \mathbb R} \max_{x \in [\pm 1]} M(S+x,V+x^2) + \lambda x ~\le~ M(S,V)$ Reasoning about small $$x$$, we may see that the only possibility for $$\lambda$$ is in fact $$\lambda = -\frac{\partial}{\partial x} M(S+x,V+x^2)|_{x=0}$$. Let us write $$M'(S,V)$$ for that function, and observe that $M'(S,V) ~=~ M(S,V) \frac{ S (2 V + \alpha |S|) }{ 2 (V + \alpha |S|)^2 } .$ It hence remains to show that this candidate in fact works. That is, we need to show that for all $$x \in [\pm 1]$$ $M(S+x,V+x^2) + M(S,V) \frac{ S (2 V + \alpha |S|) }{ 2 (V + \alpha |S|)^2 } x ~\le~ M(S,V)$ Dividing by the right-hand side, taking logarithms and reorganising, we need to show for all $$x \in [\pm 1]$$ that $\frac{(S+x)^2}{V+x^2 + \alpha |S+x|} - \frac{S^2}{V + \alpha |S|} + \ln \frac{V}{V+x^2} - 2 \ln \left( 1 - x \frac{ S (2 V + \alpha |S|) }{ 2 (V + \alpha |S|)^2 } \right) ~\le~ 0$ Our approach will be to show that the left-hand side is increasing in $$V$$, and that it takes value $$0$$ in the limit $$V \to \infty$$. The limit can be gleaned from the above equation. To see that it is increasing, we check the derivative towards $$V$$, which is $- \frac{(S+x)^2}{\left(V+x^2 + \alpha |S+x|\right)^2} + \frac{S^2}{(V + \alpha |S|)^2} + \frac{1}{V} - \frac{1}{V+x^2} + \frac{2 x S V}{ (V + \alpha |S|)^3 \left( 1+ x \frac{S (2 V +\alpha |S|)}{2(V + \alpha |S|)^2} \right) } .$ We can reorganise this derivative as a ratio of a degree 5 polynomial and a degree 8 polynomial in $$V$$. Positivity of the denominator reduces to $x S(2 V + \alpha |S|) + 2 \left(V + \alpha |S|\right)^2 ~\ge~ 0$ which holds from $$\alpha \ge \frac{1}{2}$$. Positivity of the numerator is more tricky. The coefficient on the most significant power (i.e. on $$V^5$$) is $\alpha \left(4 \left| g+S\right| ^3-4 S \left| S\right| (3 g+S)\right)+2 g^3 (g+4 S)$ and hence positivity requires at least $\alpha ~\ge~ \frac{ g^3 (g+4 S) }{ 2 S \left| S\right| (3 g+S) - 2 \left| g+S\right| ^3 }$ Maximising this over $$g \in [\pm 1]$$ and $$S \in \mathbb R$$ gives the claimed $$\alpha$$ in $$\eqref{eq:boundary}$$. We can then moreover check with symbolic algebra assistance that all other coefficients of the numerator are also positive from this $$\alpha$$ up. (The resort proof-by-computer is not satisfying, but I am currently at this somewhat frustrating state where I can ask Mathematica to prove a polynomial inequality for me that I cannot easily directly prove myself.) Having all of its coefficients positive means the numerator polynomial is always positive, meaning that the gap is indeed increasing in $$V$$ and hence maximised for $$V \to \infty$$.

# Conclusion

The FreeGrad Martingale is an interesting tool to have in the toolbox. Let me explain why. One often considers elementary martingales of exponential form, i.e. $$\prod_{s=1}^t e^{\eta X_s - c \eta^2 X_s^2}$$ for some carefully picked $$c > 1/2$$. By mixing such martingales over $$\eta$$, we may hope to achieve something close to the maximum $$\eta$$, which would be $\max_\eta \prod_{s=1}^t e^{\eta X_s - c \eta^2 X_s^2} ~=~ e^{\frac{S_t^2}{4 c V_t}} \qquad \text{with optimiser} \qquad \hat \eta ~=~ \frac{S_t}{2 c V_t} .$ We see that the FreeGrad Martingale $$\eqref{eq:freegrad.mart}$$ gets very close to this ideal. Surprisingly, FreeGrad is in some sense even better, as it behaves as if $$c=1/2$$. The costs for this are then the additive term $$\alpha |S_t|$$ in the numerator (which is typically of lower order, since $$V_t \sim n$$ while $$S_t \sim \pm \sqrt{n}$$), as well as the term $$\frac{1}{2} \ln V_t$$, which we may think of as the cost for not knowing $$\hat \eta$$ up front (and hence having to learn it from data). We find the exact same learning overhead in mixture Margtingale approaches with a Gaussian prior on $$\eta$$.

Mhammedi, Zakaria, and Wouter M. Koolen. 2020. “Lipschitz and Comparator-Norm Adaptivity in Online Learning.” In Proceedings of the 33rd Annual Conference on Learning Theory (Colt), 2858–87. http://proceedings.mlr.press/v125/mhammedi20a/mhammedi20a.pdf.