In this post we have a look at tuning the learning \(\eta\) for one form of easy (namely iid Bernoulli) data. This post grew out of a discussion with Tom Sterkenburg at CWI.

We are considering the following learning scenario. The learner is sequentially predicting a sequence of zeros and ones. Each round, the learner predicts by saying either \(0\) or \(1\), the next outcome is revealed, and the learner incurs one unit of loss if he got it wrong. We allow the learner to randomise.

Our goal is to compete with a class of strategies. For us, a strategy \(\sigma\) maps a history \(x \in \{0,1\}^*\) to a prediction \(\sigma(x) \in \{0,1\}\). We write \(|x_t - \sigma(x_{<t})|\) for the loss of strategy \(\sigma\) on outcome \(x_t\) in round \(t\) after observing \(x_{<t}\) first.

The Hedge approach for aggregating strategies starts by putting a prior \(\pi\) on strategies and fixing a learning rate \(\eta\). Then after observing \(x_{\le t}\), one computes the *Hedge posterior* weights on strategies \begin{equation}\label{eq:posterior}
\pi(\sigma|x_{\le t})
~=~
\frac{
\pi(\sigma) e^{- \eta \sum_{s=1}^{t} |x_s - \sigma(x_{<s})|}
}{
\sum_\sigma
\pi(\sigma) e^{- \eta \sum_{s=1}^{t} |x_s - \sigma(x_{<s})|}
}
,
\end{equation} and finally one predicts by a sampling from \(\pi(\cdot|x_{\le t})\). When the outcome \(x_{t+1}\) is revealed, the expected loss is hence \[\sum_\sigma \pi(\sigma|x_{\le t}) |x_{t+1} - \sigma(x_{\le t})|
.\]

The question we are looking at here is *will the Hedge posterior weights concentrate?*

We assume that we are working with a countable set of strategies that includes all computable predictors, and where the prior is given by \(\pi(\sigma) = 2^{- K(\sigma)}\), where \(K(\sigma)\) is the Kolmogorov complexity of \(\sigma\), i.e. the length in bits of the shortest computer program that implements \(\sigma\). We will not need a high degree of detail about how this works exactly. Let us just say that classes of strategies like this are common in algorithmic information theory (Solomonoff induction).

Now assume that data are generated IID from a Bernoulli \(\theta\) source with \(\theta \ge 1/2\) (to break symmetry). We focus on the following specific strategies from our class.

The Bayes act. For IID Bernoulli \(\theta\) data with \(\theta \ge 1/2\), the prediction that minimises the expected loss is identically \(1\). This prediction is implemented by a short program, say with length \(c\) bits. Even though this predictor is optimal in expectation, it will make a mistake with probability \(1-\theta\) every round. After \(t\) rounds, we hence expect roughly \(t (1-\theta)\) mistakes. The contribution of this strategy to the numerator of \(\eqref{eq:posterior}\) is hence \begin{equation}\label{eq:bayes.act} 2^{-c} e^{- \eta t(1-\theta)} \end{equation}

Pure over-fitters. For every sequence \(x_{1:t}\) of every length \(t\) there is a program that predicts it perfectly (for example using a literal encoding of this sequence in the program text). Since \(x_{1:t}\) is IID Bernoulli, we know that the shortest perfect program for it is roughly of length \(K(x_{1:t}) ~\approx~ t H(\theta)\), where \(H(\theta)\) is the Shannon entropy (in bits) of Bernoulli \(\theta\). So its contribution to the numerator of \(\eqref{eq:posterior}\) is hence \begin{equation}\label{eq:overfit} 2^{- t H(\theta)} e^{- \eta \cdot 0} \end{equation}

Comparing \(\eqref{eq:bayes.act}\) and \(\eqref{eq:overfit}\) we can already get some interesting conclusion. What we hope to happen is that the posterior weight concentrates on the shortest Bayes act program. (This is not quite what will happen as there are many longer programs also implementing the Bayes act or a finite perturbation of it, so we can only hope for convergence in prediction.) But what we definitely do not want is that the Hedge posterior gives substantial weight to pure over-fitters. For this problematic case to not occur, we need that \(\eqref{eq:overfit}\) vanishes compared to \(\eqref{eq:bayes.act}\). That is, we need \[2^{-c} e^{- \eta t(1-\theta)} \gg 2^{- t H(\theta)}\] which happens for large \(t\) if \[\eta < \frac{ H(\theta) }{ (1-\theta) \log_2(e) } .\] This is a necessary condition. It may not be sufficient, there could be other reasons why we need an even lower \(\eta\). (For example when \(\theta = \frac{1}{2}\) we expect that no \(\eta>0\) works.

Here is a plot of the above upper bound.

For IID data and a finite set of strategies we are either in a worst-case scenario (two disagreeing strategies with the same mean loss) where the regret is \(O(\sqrt{t})\), or we are in a best-case scenario where any fixed learning rate gives constant regret. The simple observation in this post shows that for larger classes the picture is a great deal more subtle.