2021-01-20

Consider random variables \(X_1, X_2, \ldots\) that are mean-zero and \(X_i \in [\pm 1]\). If we aim for simplicity we may assume that \(X\) are i.i.d., while if we aim for generality we will assume they are conditionally bounded and zero-mean. Let us abbreviate the sum by \(S_t = \sum_{s=1}^t X_s\) and the (one-round regularised) sum of squares by \(V_t = 1 + \sum_{s=1}^t X_s^2\). Now fix \(\alpha \ge 0\) and let us consider the process \begin{equation}\label{eq:freegrad.mart} M_t ~=~ \exp \left( \frac{S_t^2}{2(V_t + \alpha |S_t|)} - \frac{1}{2} \ln V_t \right) \end{equation} In this post we are asking the question: for which \(\alpha \ge 0\) is \(M_t\) a super-martingale? In the FreeGrad paper (Mhammedi and Koolen 2020), Zakaria Mhammedi and I show that \(\alpha \ge 1\) is sufficient. But is it the best choice? In this post we find a slightly tighter \(\alpha\). Namely, let \begin{equation}\label{eq:boundary} \alpha^* ~=~ \frac{1}{12} \left(4+\sqrt{78} \sin \left(\frac{1}{3} \arctan\left(\frac{\sqrt{\frac{19}{2}}}{33}\right)\right)+\sqrt{26} \cos \left(\frac{1}{3} \arctan\left(\frac{\sqrt{\frac{19}{2}}}{33}\right)\right)\right) \approx 0.780891 \end{equation} be the rightmost root of the cubic \(64 - 45 x - 216 x^2 + 216 x^3\). We will argue that \(\alpha^*\) is sufficient.

First let’s consider the starting value. Plugging in \(S_0 = 0\) and \(V_0 = 1\) we find indeed that \(M_0 = 1\). Then, to reason about the increments, let us abbreviate \[M(S,V) ~=~ \exp \left( \frac{S^2}{2(V + \alpha |S|)} - \frac{1}{2} \ln V \right)\] We need to show that for every distribution on \(X \in [\pm 1]\) with zero mean, we have \[\operatorname*{\mathbb E}_X\left[M(S+X,V+X^2)\right] ~\le~ M(S,V)\] This is equivalent to showing that \[\min_{\lambda \in \mathbb R} \max_{x \in [\pm 1]} M(S+x,V+x^2) + \lambda x ~\le~ M(S,V)\] Reasoning about small \(x\), we may see that the only possibility for \(\lambda\) is in fact \(\lambda = -\frac{\partial}{\partial x} M(S+x,V+x^2)|_{x=0}\). Let us write \(M'(S,V)\) for that function, and observe that \[M'(S,V) ~=~ M(S,V) \frac{ S (2 V + \alpha |S|) }{ 2 (V + \alpha |S|)^2 } .\] It hence remains to show that this candidate in fact works. That is, we need to show that for all \(x \in [\pm 1]\) \[M(S+x,V+x^2) + M(S,V) \frac{ S (2 V + \alpha |S|) }{ 2 (V + \alpha |S|)^2 } x ~\le~ M(S,V)\] Dividing by the right-hand side, taking logarithms and reorganising, we need to show for all \(x \in [\pm 1]\) that \[\frac{(S+x)^2}{V+x^2 + \alpha |S+x|} - \frac{S^2}{V + \alpha |S|} + \ln \frac{V}{V+x^2} - 2 \ln \left( 1 - x \frac{ S (2 V + \alpha |S|) }{ 2 (V + \alpha |S|)^2 } \right) ~\le~ 0\] Our approach will be to show that the left-hand side is increasing in \(V\), and that it takes value \(0\) in the limit \(V \to \infty\). The limit can be gleaned from the above equation. To see that it is increasing, we check the derivative towards \(V\), which is \[- \frac{(S+x)^2}{\left(V+x^2 + \alpha |S+x|\right)^2} + \frac{S^2}{(V + \alpha |S|)^2} + \frac{1}{V} - \frac{1}{V+x^2} + \frac{2 x S V}{ (V + \alpha |S|)^3 \left( 1+ x \frac{S (2 V +\alpha |S|)}{2(V + \alpha |S|)^2} \right) } .\] We can reorganise this derivative as a ratio of a degree 5 polynomial and a degree 8 polynomial in \(V\). Positivity of the denominator reduces to \[x S(2 V + \alpha |S|) + 2 \left(V + \alpha |S|\right)^2 ~\ge~ 0\] which holds from \(\alpha \ge \frac{1}{2}\). Positivity of the numerator is more tricky. The coefficient on the most significant power (i.e. on \(V^5\)) is \[\alpha \left(4 \left| g+S\right| ^3-4 S \left| S\right| (3 g+S)\right)+2 g^3 (g+4 S)\] and hence positivity requires at least \[\alpha ~\ge~ \frac{ g^3 (g+4 S) }{ 2 S \left| S\right| (3 g+S) - 2 \left| g+S\right| ^3 }\] Maximising this over \(g \in [\pm 1]\) and \(S \in \mathbb R\) gives the claimed \(\alpha\) in \(\eqref{eq:boundary}\). We can then moreover check with symbolic algebra assistance that all other coefficients of the numerator are also positive from this \(\alpha\) up. (The resort proof-by-computer is not satisfying, but I am currently at this somewhat frustrating state where I can ask Mathematica to prove a polynomial inequality for me that I cannot easily directly prove myself.) Having all of its coefficients positive means the numerator polynomial is always positive, meaning that the gap is indeed increasing in \(V\) and hence maximised for \(V \to \infty\).

The FreeGrad super-martingale is an interesting tool to have in the toolbox. Let me explain why. One often considers elementary martingales of exponential form, i.e. \(\prod_{s=1}^t e^{\eta X_s - c \eta^2 X_s^2}\) for some carefully picked \(c > 1/2\). By mixing such martingales over \(\eta\), we may hope to achieve something close to the maximum \(\eta\), which would be \[\max_\eta \prod_{s=1}^t e^{\eta X_s - c \eta^2 X_s^2} ~=~ e^{\frac{S_t^2}{4 c V_t}} \qquad \text{with optimiser} \qquad \hat \eta ~=~ \frac{S_t}{2 c V_t} .\] We see that the FreeGrad super-martingale \(\eqref{eq:freegrad.mart}\) gets very close to this ideal. Surprisingly, FreeGrad is in some sense even better, as it behaves as if \(c=1/2\). The costs for this are then the additive term \(\alpha |S_t|\) in the numerator (which is typically of lower order, since \(V_t \sim n\) while \(S_t \sim \pm \sqrt{n}\)), as well as the term \(\frac{1}{2} \ln V_t\), which we may think of as the cost for not knowing \(\hat \eta\) up front (and hence having to learn it from data). We find the exact same learning overhead in mixture martingale approaches with a Gaussian prior on \(\eta\).

My thanks to Aaditya Ramdas for pointing out I overlooked the \(+1\) in the definition of \(V_t\).

Mhammedi, Zakaria, and Wouter M. Koolen. 2020. “Lipschitz and Comparator-Norm Adaptivity in Online Learning.” In *Proceedings of the 33rd Annual Conference on Learning Theory (Colt)*, 2858–87. http://proceedings.mlr.press/v125/mhammedi20a/mhammedi20a.pdf.