2018-06-11

This post, which is self-contained, came about as a reflection on my recent paper (Kaufmann, Koolen, and Garivier 2018), which builds on work by (Russo 2016). In both, one central object of study is a certain update equation, which is similar to exponential weights but generates surprisingly different behaviour.

Fix \(K\) strictly positive numbers \(\boldsymbol{d}= (d_1, \ldots, d_K)\). Consider the function \(f : \triangle_K \to \triangle_K\) mapping the \(K\)-probability simplex to itself defined by \begin{equation}\label{eq:map} f(\boldsymbol{w}) ~=~ \left(f_1(\boldsymbol{w}), \ldots, f_K(\boldsymbol{w})\right) \qquad \text{where} \qquad f_i(\boldsymbol{w}) ~=~ \frac{w_i e^{- w_i d_i}}{\sum_{j=1}^K w_j e^{- w_j d_j}} . \end{equation} The subtle difference from the exponential weights update is the presence of the weights \(w_i\) multiplying the coefficient \(d_i\) in the exponent. As an example, \(f\) maps the uniform weights to \[\left(\frac{1}{K}, \ldots, \frac{1}{K}\right) ~\stackrel{f}{\to}~ \left( \frac{e^{- d_1/K }}{\sum_{j=1}^K e^{- d_j/K }} , \ldots, \frac{e^{- d_K/K }}{\sum_{j=1}^K e^{- d_j/K }} \right) .\] In this post we are considering the behaviour of the iterates \[\boldsymbol{w}, f(\boldsymbol{w}), f(f(\boldsymbol{w})), \ldots\] In particular, we are interested in \(\lim_{n \to \infty} f^n(\boldsymbol{w})\), where \(f^n\) denotes the \(n\)-fold iterated application of \(f\). In this post we will look at conditions for these iterates to converge. Now if they converge, they converge to a fixed point. So let’s look at those.

A fixed point \(\boldsymbol{w}\) of \(f\) satisfies \(\boldsymbol{w}= f(\boldsymbol{w})\), i.e. for all \(i \in [K]\) \[w_i ~=~ \frac{w_i e^{-w_i d_i}}{\sum_{j=1}^K w_j e^{-w_j d_j}}
.\] We see that for each \(i \in [K]\) either \(w_i = 0\) or \(e^{-w_i d_i} = \sum_{j=1}^K w_j e^{-w_j d_j}\). In the latter case the right-hand side is independent of \(i\), and hence \(w_i \propto 1/d_i\). So for each non-zero subset \(S \subseteq [K]\) of non-zeros we obtain a fixed point \(\boldsymbol{w}^S\) given by \[w_i^S ~=~ \frac{\mathbf 1_{i \in S}/d_i}{\sum_{j=1}^K \mathbf 1_{j \in S}/d_j}.\] Hence there are \(2^K-1\) fixed points. We denote the *central* fixed point, having fully dense \(S = [K]\), by \(\boldsymbol{w}^* = \boldsymbol{w}^{[K]}\), i.e. \[w^*_i ~=~ \frac{1/d_i}{\sum_{j=1}^K 1/d_j}
.\]

The figure below visualises the mapping \(\eqref{eq:map}\) as a function of the magnitude of \(\boldsymbol{d}\). We see that small \(\boldsymbol{d}\) lead to smooth mappings, where all interior starting points eventually converge to the central fixed point \(\boldsymbol{w}^*\). For medium \(\boldsymbol{d}\) the convergence is faster. For large \(\boldsymbol{d}\) the iterates become unstable. In the last row of the figure we see that the iterates do not converge.

Let’s investigate if we can explain what we see happen in these plots. It turns out that we cannot simply apply the Banach Fixed-Point Theorem, as \(f\) is not in general a contraction. Yet by looking at its behaviour near the central fixed point \(\boldsymbol{w}^*\), we can identify a phase transition showing that there are (at least) two kinds of \(\boldsymbol{d}\).

We analyse the behaviour of \(f\) at \(\boldsymbol{w}\) near the fixed point \(\boldsymbol{w}^*\). Let’s say \(\boldsymbol{w}= \boldsymbol{w}^* + \epsilon \boldsymbol{z}\) where \(\sum_{i=1}^K z_i = 0\) and \(|\epsilon|\) small. We use that \(d_i w^*_i = \frac{1}{\sum_{j=1}^K 1/d_j}\), a constant. Then approximating \(e^{\epsilon x}\) by \(1+\epsilon x\) and ignoring all terms of order \(\epsilon^2\), we find \begin{align*} f_i(\boldsymbol{w}) &~\propto~ (w^*_i + \epsilon z_i) e^{-(w^*_i + \epsilon z_i) d_i} \\ &~\propto~ (w^*_i + \epsilon z_i) e^{-\epsilon z_i d_i} \\ &~\approx~ (w^*_i + \epsilon z_i) (1-\epsilon z_i d_i) \\ &~\approx~ w^*_i + \epsilon z_i - \epsilon z_i \frac{1}{\sum_{j=1}^K 1/d_j} \\ &~=~ w^*_i + \left(1- \frac{1}{\sum_{j=1}^K 1/d_j}\right) \epsilon z_i \end{align*} Stated another way, this shows that \[f(\boldsymbol{w}) - \boldsymbol{w}^* ~\approx~ \left(1- \frac{1}{\sum_{j=1}^K 1/d_j}\right) (\boldsymbol{w}-\boldsymbol{w}^*) .\] From this we conclude that the distance decreases geometrically with base \(|1- \frac{1}{\sum_{j=1}^K 1/d_j}|\). Now when is \(\left|1- \frac{1}{\sum_{j=1}^K 1/d_j}\right| < 1\)? This is iff \[\frac{1}{2} < \sum_{j=1}^K \frac{1}{d_j} .\] Of course this analysis only holds near the central fixed point \(\boldsymbol{w}^*\). As this is the point we would expect the iterates to converge to, it is the place to look. It might be that the derailment of points far away from \(\boldsymbol{w}^*\) kicks in earlier. But what we see in experiments for those points is chaotic behaviour, leading to a point near \(\boldsymbol{w}^*\) to appear at some point, and the convergence indeed kicking in. The above threshold does correspond to the derailment threshold in experiments.

As a researcher working in online learning, I am quite familiar with the *exponential weights update* \(w_i \mapsto \frac{w_i e^{-d_i}}{\sum_{j=1}^K w_j e^{-d_j}}\). Here, the fixed point of iterating the mapping is a point-mass on \(\arg\min_i d_i\) (assuming that \(\arg\min_i d_i\) is a singleton). The mapping \(f\) in \(\eqref{eq:map}\) is rather similar to the exponential weights update. Yet its iterated behaviour is completely different, as we exploited recently in (Kaufmann, Koolen, and Garivier 2018), building on (Russo 2016). It would be very interesting to study other variant update equations and their limiting behaviour.

Kaufmann, Emilie, Wouter M. Koolen, and Aurélien Garivier. 2018. “Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling.” In *Neurips 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6333–43. Curran Associates, Inc. http://papers.nips.cc/paper/7870-sequential-test-for-the-lowest-mean-from-thompson-to-murphy-sampling.pdf.

Russo, Daniel. 2016. “Simple Bayesian Algorithms for Best Arm Identification.” *CoRR* abs/1602.08448. http://arxiv.org/abs/1602.08448.