Close to the Perfect Expert, now Without Outcomes

Wouter M. Koolen

2015-02-22

Abstract

We look at one of the simplest online learning problems: making decisions in the presence of a perfect expert. We derive a non-uniform regret bound.

Introduction

A classic problem in online learning is called prediction with expert advice. The idea is that we have a bunch of predictors called experts, and we want to design a sequential prediction algorithm that performs almost as well as the best among these experts, without knowing in advance which expert will be the best. We look at a very simple instance of this problem. First, we abstract away the predictions of these experts, and simply considers them as strategies that can be followed, and that incur loss. This is sometimes called decision-theoretic online learning. Second, we promise the learner that there is one perfect expert that never incurs any loss. This is known as the noise-free or realisable case. We study what can be guaranteed in this setting.

Setup

Every round \(t\) the algorithm plays a distribution \(\boldsymbol w_t = (w_t^1, w_t^2, \ldots)\) on a (finite or countably infinite) sequence of experts. Then the losses \(\boldsymbol \ell_t = (\ell_t^1, \ell_t^2, \ldots)\) of the experts are revealed, and the algorithm suffers loss \(\boldsymbol w^\intercal\boldsymbol \ell= \sum_{i=1}^\infty w_t^i \ell_t^i\). We assume that \(\ell_t^i \in [0,1]\). We are promised that there is (at least) one perfect expert.

We consider the following natural algorithm, parametrised by a prior probability distribution \(\pi\) on experts: in round \(t\) play \(w_t^i = \pi (i|\textsf{live}_t)\), where \(\textsf{live}_t\) is the set of experts that have not suffered any loss before time \(t\). What is this algorithm’s worst-case loss?

Analysis

In the worst-case a single expert is eliminated each round, and only a single expert, say \(j\), remains alive. Moreover, the worst-case order is to eliminate the experts in order from large to small prior, skipping over the best expert \(j\). Let us assume that the experts are indexed so that the prior \(\pi\) is decreasing. The loss then equals \begin{equation}\label{eq:loss} \sum_{i=1}^{j-1} \frac{\pi(i)}{\pi(\ge i)} + \sum_{i=j+1}^\infty \frac{\pi(i)}{\pi(j) + \pi(\ge i)} \end{equation} We now bound each term in the first sum by \[\frac{\pi(i)}{\pi(\ge i)} ~\le~ - \ln \left( 1 - \frac{\pi(i)}{\pi(\ge i)} \right) ~=~ - \ln \frac{- \pi(i) + \pi(\ge i)}{\pi(\ge i)} ~=~ \ln \frac{\pi(\ge i)}{\pi(\ge i+1)} ,\] and upon plugging this bound into the first sum in \(\eqref{eq:loss}\), it telescopes to \[\sum_{i=1}^{j-1} \frac{\pi(i)}{\pi(\ge i)} ~\le~ \sum_{i=1}^{j-1} \ln \frac{\pi(\ge i)}{\pi(\ge i+1)} ~=~ - \ln \pi(\ge j) .\] Similarly for the terms in the second sum we have \[\frac{\pi(i)}{\pi(j) + \pi(\ge i)} ~\le~ - \ln \left( 1-\frac{\pi(i)}{\pi(j) + \pi(\ge i)} \right) ~=~ \ln \left( \frac{\pi(j) + \pi(\ge i)}{\pi(j) + \pi(\ge i+1)} \right) ,\] so that now the second sum of \(\eqref{eq:loss}\) is bounded by the telescoping expression: \[\sum_{i=j+1}^\infty \frac{\pi(i)}{\pi(j) + \pi(\ge i)} ~\le~ \sum_{i=j+1}^\infty \ln \left( \frac{\pi(j) + \pi(\ge i)}{\pi(j) + \pi(\ge i+1)} \right) ~=~ \ln \left( \frac{\pi(\ge j)}{\pi(j)} \right) .\] All in all, we obtain that the worst-case loss \(\eqref{eq:loss}\) is bounded by \[- \ln \pi(j) .\]

I find it surprising that this is exactly the regret bound of the logarithmic loss (or in fact any exp-concave or mixable loss with rate \(1\)) game without the perfect expert promise.