The Hedge Algorithm in Continuous Time

Wouter M. Koolen



We look at the Hedge algorithm in a version of continuous time.


The Hedge algorithm (see (Freund and Schapire 1997)) is one of the fundamental online learning algorithms. Hedge maintains weights on experts, and updates these weights in response to these experts incurring losses. Here we look at a version in “smooth” time. Instead of handing out the losses instantaneously, we consider what happens if we reveal them gradually.

Let’s review the definition of Hedge. Consider \(K\) experts. We denote by \(L_t^k\) the cumulative loss after \(t\) rounds of expert \(k\). In round \(t\) (after \(t-1\) rounds) the Hedge algorithm with learning rate \(\eta\) plays the probability vector \(\boldsymbol{w}_t\) with coordinates \[w_t^k ~=~ \frac{ e^{- \eta L_{t-1}^k} }{ \sum_{k=1}^K e^{- \eta L_{t-1}^k} } .\] Then the adversary reveals the loss vector \(\boldsymbol{\ell}_t \in [0,1]^K\) of round \(t\) and the algorithm incurs the dot loss \[\boldsymbol{\ell}_t^\intercal\boldsymbol{w}_t ~=~ \frac{ \sum_{k=1}^K \ell_t^k e^{- \eta L_{t-1}^k} }{ \sum_{k=1}^K e^{- \eta L_{t-1}^k} } .\] Now if we instead hand out the loss in \(n\) small identical pieces, with the algorithm updating in response to each piece, the loss is \[\sum_{i=1}^n \frac{ \sum_{k=1}^K \frac{\ell_t^k}{n} e^{- \eta (L_{t-1}^k + \frac{i-1}{n} \ell_t^k)} }{ \sum_{k=1}^K e^{- \eta (L_{t-1}^k + \frac{i-1}{n} \ell_t^k)} }\] Taking the limit of \(n \to \infty\), the sum above becomes the integral \[\int_0^1 \frac{ \sum_{k=1}^K \ell_t^k e^{- \eta (L_{t-1}^k + s \ell_t^k)} }{ \sum_{k=1}^K e^{- \eta (L_{t-1}^k + s \ell_t^k)} } \textrm ds .\] Observing that the numerator is, up to scaling, the derivative of the denominator, allows us to rewrite this as \[\left. - \frac{1}{\eta} \ln \Big(\sum_{k=1}^K e^{- \eta (L_{t-1}^k + s \ell_t^k)}\Big) \right|_{s=0}^1 ~=~ - \frac{1}{\eta} \ln \frac{\sum_{k=1}^K e^{- \eta L_t^k}}{\sum_{k=1}^K e^{- \eta L_{t-1}^k}} ~=~ - \frac{1}{\eta} \ln \left( \sum_{k=1}^K w_t^k e^{- \eta \ell_t^k} \right) ,\] which we recognise as the mix loss of Hedge. So in continuous time the difference between dot loss and mix loss evaporates.

Freund, Yoav, and Robert E. Schapire. 1997. “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting.” Journal of Computer and System Sciences 55: 119–39.