We look at one of the simplest online learning problems: making decisions in the presence of a perfect expert. We derive a non-uniform regret bound.

A classic problem in online learning is called *prediction with expert advice*. The idea is that we have a bunch of predictors called *experts*, and we want to design a sequential prediction algorithm that performs almost as well as the best among these experts, without knowing in advance which expert will be the best. We look at a very simple instance of this problem. First, we abstract away the predictions of these experts, and simply considers them as strategies that can be followed, and that incur loss. This is sometimes called *decision-theoretic online learning*. Second, we promise the learner that there is one perfect expert that never incurs any loss. This is known as the *noise-free* or *realisable* case. We study what can be guaranteed in this setting.

Every round \(t\) the algorithm plays a distribution \({\boldsymbol w}_t = (w_t^1, w_t^2, \ldots)\) on a (finite or countably infinite) sequence of experts. Then the losses \({\boldsymbol {\ell}}_t = ({\ell}_t^1, {\ell}_t^2, \ldots)\) of the experts are revealed, and the algorithm suffers loss \({\boldsymbol w}^{\intercal}{\boldsymbol {\ell}}= \sum_{i=1}^\infty w_t^i {\ell}_t^i\). We assume that \({\ell}_t^i \in [0,1]\). We are promised that there is (at least) one perfect expert.

We consider the following natural algorithm, parametrised by a prior probability distribution \(\pi\) on experts: in round \(t\) play \(w_t^i = \pi {(i|\textsf{live}_t)}\), where \(\textsf{live}_t\) is the set of experts that have not suffered any loss before time \(t\). What is this algorithmâ€™s worst-case loss?

In the worst-case a single expert is eliminated each round, and only a single expert, say \(j\), remains alive. Moreover, the worst-case order is to eliminate the experts in order from large to small prior, skipping over the best expert \(j\). Let us assume that the experts are indexed so that the prior \(\pi\) is decreasing. The loss then equals \begin{equation}\label{eq:loss} \sum_{i=1}^{j-1} \frac{\pi(i)}{\pi(\ge i)} + \sum_{i=j+1}^\infty \frac{\pi(i)}{\pi(j) + \pi(\ge i)} \end{equation} We now bound each term in the first sum by \[\frac{\pi(i)}{\pi(\ge i)} ~\le~ - \ln {\left( 1 - \frac{\pi(i)}{\pi(\ge i)} \right)} ~=~ - \ln \frac{- \pi(i) + \pi(\ge i)}{\pi(\ge i)} ~=~ \ln \frac{\pi(\ge i)}{\pi(\ge i+1)} ,\] and upon plugging this bound into the first sum in \(\eqref{eq:loss}\), it telescopes to \[\sum_{i=1}^{j-1} \frac{\pi(i)}{\pi(\ge i)} ~\le~ \sum_{i=1}^{j-1} \ln \frac{\pi(\ge i)}{\pi(\ge i+1)} ~=~ - \ln \pi(\ge j) .\] Similarly for the terms in the second sum we have \[\frac{\pi(i)}{\pi(j) + \pi(\ge i)} ~\le~ - \ln {\left( 1-\frac{\pi(i)}{\pi(j) + \pi(\ge i)} \right)} ~=~ \ln {\left( \frac{\pi(j) + \pi(\ge i)}{\pi(j) + \pi(\ge i+1)} \right)} ,\] so that now the second sum of \(\eqref{eq:loss}\) is bounded by the telescoping expression: \[\sum_{i=j+1}^\infty \frac{\pi(i)}{\pi(j) + \pi(\ge i)} ~\le~ \sum_{i=j+1}^\infty \ln {\left( \frac{\pi(j) + \pi(\ge i)}{\pi(j) + \pi(\ge i+1)} \right)} ~=~ \ln {\left( \frac{\pi(\ge j)}{\pi(j)} \right)} .\] All in all, we obtain that the worst-case loss \(\eqref{eq:loss}\) is bounded by \[- \ln \pi(j) .\]

I find it surprising that this is exactly the regret bound of the logarithmic loss (or in fact any exp-concave or mixable loss with rate \(1\)) game *without the perfect expert promise*.