2016-11-10

Abstract

We look at extending two recent adaptive online learning strategies, iProd and Squint, to the matrix Hedge setting. The extension is straightforward up to a point, but we end with a nice open problem.

A great many algorithms for online learning maintain a vector parameter. Gradient Descent, Exponential Weights, Mirror Descent, etc. Many of these online learning algorithms have a ntural matrix counterpart. A random sample of papers is here (Tsuda, Rätsch, and Warmuth 2005), (Warmuth and Kuzmin 2006), (Warmuth and Kuzmin 2008), (Warmuth and Kuzmin 2010), (Koolen, Kotłowski, and Warmuth 2011), (Hazan, Kale, and Shalev-Shwartz 2012). A variety of settings are considered in the literature, differing in the specific matrix domains considered and in the choice of loss function. In this post we look at the Hedge setting and the matrix Hedge setting. The matrix Hedge setting is interesting from the perspective of applications (e.g. PCA, see (Warmuth and Kuzmin 2008)) but also from a mathematical point of view, to see which ideas survive and which new tricks are needed. In this post we upgrade the recent iProd and Squint algorithms by (Koolen and Van Erven 2015) to matrices. These algorithms have so-called second-order quantile regret bounds, and the goal will be to derive these for the matrix setting.

We start by reviewing the matrix Hedge setting. Learning proceeds in rounds \(t=1, 2, \ldots\). In round \(t\) the learner plays a density matrix \(\boldsymbol{W}_t\). A density matrix is a positive semi-definite matrix with unit trace. Density matrices generalise probability mass functions. Their eigenvalues form a probability distribution, but in addition they carry an orthogonal transformation in the eigenvectors. The adversary then reveals a symmetric loss matrix \(\boldsymbol{L}_t\). We assume that the eigenvalues of \(\boldsymbol{L}_t\) are in \([0,1]\). The loss is \(\mathop{\mathrm{tr}}(\boldsymbol{W}_t \boldsymbol{L}_t)\), i.e. the default dot product of matrices. We define the instantaneous regret matrix \(\boldsymbol{R}_t\) by \[\boldsymbol{R}_t ~:=~ \mathop{\mathrm{tr}}(\boldsymbol{W}_t \boldsymbol{L}_t) \boldsymbol{I}- \boldsymbol{L}_t .\] So \(\boldsymbol{R}_t\) is symmetric with eigenvalues falling in \([-1,1]\). The goal of the learner is to keep the cumulative regret matrix \(\sum_{t=1}^T \boldsymbol{R}_t\) as small as possible. The traditional regret bound for the Matrix Hedge algorithm by (Warmuth and Kuzmin 2006) in \(d\) dimensions is \[\sum_{t=1}^T \mathop{\mathrm{tr}}(\boldsymbol{W}_t \boldsymbol{L}_t) - \min_{\boldsymbol{W}} \sum_{t=1}^T \mathop{\mathrm{tr}}(\boldsymbol{W}\boldsymbol{L}_t) ~=~ \lambda_{\max} \left(\sum_{t=1}^T \boldsymbol{R}_t\right) ~\le~ \sqrt{T/2 \ln d} .\] This regret bound is equal to the regret bound for the standard (vector) Hedge setting. In this sense predicting with density matrices is not any harder than predicting with probability distributions. The central point of this post is to replace the time horizon \(T\) by a more sophisticated measure of the complexity of the data.

Next we review the matrix basics. Let \(\boldsymbol{A}\) be a real symmetric matrix. Let us eigendecompose \(\boldsymbol{A}= \sum_i \lambda_i \boldsymbol{u}_i \boldsymbol{u}_i^\intercal\). The trace of \(\boldsymbol{A}\) is the sum of the eigenvalues \[\mathop{\mathrm{tr}}(\boldsymbol{A}) ~=~ \sum_i \lambda_i.\] The square, exponential and logarithm of a matrix are the matrices obtained by applying each operation to the eigenvalues (aka spectrally) \begin{align*} \boldsymbol{A}^2 &~=~ \sum_i \lambda_i^2 \boldsymbol{u}_i \boldsymbol{u}_i^\intercal \\ \exp(\boldsymbol{A}) &~=~ \sum_i e^{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^\intercal , \\ \ln(\boldsymbol{A}) &~=~ \sum_i \ln(\lambda_i) \boldsymbol{u}_i \boldsymbol{u}_i^\intercal . \end{align*} The Golden-Thompson inequality states that for any real symmetric \(\boldsymbol{A}\) and \(\boldsymbol{B}\) \begin{equation}\label{eq:gt} \mathop{\mathrm{tr}}(\exp(\boldsymbol{A}+ \boldsymbol{B})) ~\le~ \mathop{\mathrm{tr}}(\exp(\boldsymbol{A})\exp(\boldsymbol{B})) . \end{equation} We say that \(\boldsymbol{A}\preceq \boldsymbol{B}\) if \(\boldsymbol{B}-\boldsymbol{A}\) is positive semi-definite. In particular, \begin{equation}\label{eq:prod} \exp(\eta \boldsymbol{R}- \eta^2 \boldsymbol{R}^2) ~\preceq~ \boldsymbol{I}+ \eta \boldsymbol{R} \end{equation} for \(\eta \in [0,1/2]\) and any real symmetric \(\boldsymbol{R}\) with eigenvalues in \([-1,1]\). The Von Neumann relative entropy between density matrices is given by \[\triangle \left(\boldsymbol{W}\middle\|\boldsymbol{U}\right) ~=~ \mathop{\mathrm{tr}}\left(\boldsymbol{W}\left(\ln \boldsymbol{W}- \ln \boldsymbol{U}\right)\right) .\] As for the Shannon entropy, \(\triangle \left(\boldsymbol{W}\middle\|\boldsymbol{U}\right) \in [0,\ln d]\). We have the familiar duality result (see e.g. this post) \begin{equation}\label{eq:duality} \inf_{\boldsymbol{W}}~ \triangle \left(\boldsymbol{W}\middle\|\boldsymbol{U}\right) + \eta \mathop{\mathrm{tr}}\left(\boldsymbol{W}\boldsymbol{L}\right) ~=~ - \ln \mathop{\mathrm{tr}}\left(\exp(\ln \boldsymbol{U}-\eta \boldsymbol{L})\right) \end{equation} where \(\boldsymbol{W}\) ranges over all density matrices.

In this section we look at the upgrade of iProd to the matrix setting. Following (Koolen and Van Erven 2015), we will make use of a prior distribution \(\gamma(\eta)\) on learning rates \(\eta \in [0,1/2]\). The *Matrix iProd* potential and its associated prediction are given by \begin{equation}\label{eq:MiProd.w}
\Phi_T
~=~
\mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[
\mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta\right)
\right]
\qquad
\text{and}
\qquad
\boldsymbol{W}_{T+1}
~=~
\frac{
\mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[
\boldsymbol{F}_T^\eta
\eta
\right]
}{
\mathop{\mathrm{tr}}\left(
\mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[
\boldsymbol{F}_T^\eta
\eta
\right]
\right)
}
,
\end{equation} where \[\boldsymbol{F}_T^\eta
~=~
\exp \left(\sum_{t=1}^T \ln \left(\boldsymbol{I}+ \eta \boldsymbol{R}_t\right)\right)
~=~
\bigodot_{t=1}^T \left(\boldsymbol{I}+ \eta \boldsymbol{R}_t\right)
.\] The succinct right-hand notation uses the *commutative matrix product* \(\odot\), defined for positive semi-definite matrices \(\boldsymbol{A}\) and \(\boldsymbol{B}\) by \(\boldsymbol{A}\odot \boldsymbol{B}= \exp\big(\ln(\boldsymbol{A}) + \ln(\boldsymbol{B})\big)\). Let us now show that Matrix iProd has a second-order relative-entropy regret bound.

The argument that Matrix iProd keeps the regret small has two parts. First we show that the weights keep the potential small. And second we show that small potential implies small regret.

First, we argue that the weights ensure that the potential decreases (non-strictly) every round. Namely, by Golden-Thompson \begin{align*} \Phi_{T+1} &~=~ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[\mathop{\mathrm{tr}}\left(\boldsymbol{F}_{T+1}^\eta\right)\right] \\ &~\stackrel{\eqref{eq:gt}}{\le}~ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[\mathop{\mathrm{tr}}\left(\boldsymbol{F}_{T}^\eta (\boldsymbol{I}+ \eta \boldsymbol{R}_{T+1})\right)\right] \\ &~=~ \Phi_T + \underbrace{ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[\mathop{\mathrm{tr}}\left(\boldsymbol{F}_{T}^\eta \eta \boldsymbol{R}_{T+1}\right)\right] }_{=0} , \end{align*} where the last term is zeroed for any matrix \(\boldsymbol{L}_{T+1}\) by the choice of weights \(\eqref{eq:MiProd.w}\), since \begin{align*} \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta \eta \boldsymbol{R}_{T+1} \right) \right] &~=~ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta \eta \left(\mathop{\mathrm{tr}}(\boldsymbol{W}_{T+1} \boldsymbol{L}_{T+1}) \boldsymbol{I}- \boldsymbol{L}_{T+1}\right) \right) \right] \\ &~=~ \mathop{\mathrm{tr}}\left( \left( \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta \right) \eta \right] \boldsymbol{W}_{T+1} - \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \boldsymbol{F}_T^\eta \eta \right] \right) \boldsymbol{L}_{T+1} \right) \\ &~=~ \mathop{\mathrm{tr}}\left( \boldsymbol{0} \cdot \boldsymbol{L}_{T+1} \right) ~=~ 0 . \end{align*}

The preceding argument shows that \(\Phi_T\) is decreasing (non-strictly) with \(T\). We hence find \(\Phi_T \le \Phi_0 = d\). From this, we see that for any density matrix \(\boldsymbol{W}\) and learning rate \(\eta\) with prior mass \(\gamma(\eta\)), \begin{align*} \ln d &~\ge~ \ln \Phi_T \\ &~\ge~ \ln \gamma(\eta) + \ln \mathop{\mathrm{tr}}\left(\boldsymbol{F}_T^\eta\right) \\ & ~\stackrel{\eqref{eq:duality}}{\ge}~ \ln \gamma(\eta) - \mathop{\mathrm{tr}}\left(\boldsymbol{W}\ln \boldsymbol{W}\right) + \mathop{\mathrm{tr}}\left(\boldsymbol{W}\sum_{t=1}^T \ln \left(\boldsymbol{I}+ \eta \boldsymbol{R}_t\right)\right) % \\ & ~\stackrel{\eqref{eq:prod}}{\ge}~ \ln \gamma(\eta) - \mathop{\mathrm{tr}}\left(\boldsymbol{W}\ln \boldsymbol{W}\right) + \mathop{\mathrm{tr}}\left(\boldsymbol{W}\left(\eta \sum_{t=1}^T \boldsymbol{R}_t - \eta^2 \sum_{t=1}^T \boldsymbol{R}_t^2\right)\right) . \end{align*} We divide by \(\eta\) and reorganise this to \[\mathop{\mathrm{tr}}\left(\boldsymbol{W}\sum_{t=1}^T \boldsymbol{R}_t\right) ~\le~ \frac{ - \ln \gamma(\eta) + \triangle \left(\boldsymbol{W}\middle\|\boldsymbol{I}/d\right) }{ \eta } + \eta \mathop{\mathrm{tr}}\left(\boldsymbol{W}\sum_{t=1}^T \boldsymbol{R}_t^2\right) .\] At this point we may hook into the existing analysis from (Koolen and Van Erven 2015) (or, for that matter, this blog post) for choosing \(\gamma\) and tuning \(\eta\). Briefly, we may choose \(\gamma\) such that \(- \ln \gamma(\eta) = O(\ln \ln T)\). Then picking \[\eta ~=~ \sqrt{ \frac{ O(\ln \ln T) + \triangle \left(\boldsymbol{W}\middle\|\boldsymbol{I}/d\right) }{ \mathop{\mathrm{tr}}\left(\boldsymbol{W}\sum_{t=1}^T \boldsymbol{R}_t^2\right) } } ,\] and expanding the definition of the instantaneous regret matrix \(\boldsymbol{R}_t = \mathop{\mathrm{tr}}(\boldsymbol{W}_t \boldsymbol{L}_t) \boldsymbol{I}- \boldsymbol{L}_t\), we find that the cumulative regret compared to \(\boldsymbol{W}\) is bounded by \begin{align} & \notag \sum_{t=1}^T \mathop{\mathrm{tr}}(\boldsymbol{W}_t \boldsymbol{L}_t) - \sum_{t=1}^T \mathop{\mathrm{tr}}\left(\boldsymbol{W}\boldsymbol{L}_t\right) ~=~ \mathop{\mathrm{tr}}\left(\boldsymbol{W}\sum_{t=1}^T \boldsymbol{R}_t\right) \\ \label{eq:regret.bound} &~=~ O \left( \sqrt{ \big( \ln \ln T + \triangle \left(\boldsymbol{W}\middle\|\boldsymbol{I}/d\right) \big) \mathop{\mathrm{tr}}\left(\boldsymbol{W}\sum_{t=1}^T \boldsymbol{R}_t^2\right) } \right)\end{align} for any density matrix \(\boldsymbol{W}\). This is a very reasonable upgrade of the second-order quantile bound for classical iProd to matrices.

Let us now think about computing the Matrix iProd prediction \(\eqref{eq:MiProd.w}\). As argued by (Koolen and Van Erven 2015), it suffices to compute the integral over \(\eta\) on a grid of \(\ln T\) exponentially spaced values. If we maintain the matrix \[\sum_{t=1}^T \ln \left(\boldsymbol{I}+ \eta \boldsymbol{R}_t\right)\] for each \(\eta\) from the grid then we have to do \(O(d^3)\) work per round per learning rate to compute the matrix exponentials. Unfortunately, the eigensystem of the above matrix, which is the argument of the matrix exponential, is a function of \(\eta\), meaning that we cannot re-use the eigendecomposition between learning rates. So all in all this algorithm can be implemented in \(O(d^3 \ln T)\) time per round with \(O(d^2 \ln T)\) space. We now consider a different algorithm that might fare better in this regard.

In this section we look at the upgrade of Squint by (Koolen and Van Erven 2015) to the matrix setting. In the vector setting, Squint is a slight weakening of iProd with the computational advantage that the integral over \(\eta\) can be computed in closed form. Let’s see if this remains useful in the Matrix setting. Whereas for iProd the bound \(\eqref{eq:prod}\) appears in the analysis, for Squint it is incorporated in the algorithm.

Matrix Squint, like iProd, makes use of a prior distribution \(\gamma(\eta)\) on learning rates \(\eta \in [0,1/2]\). The *Matrix Squint* potential function and weights are given by \(\eqref{eq:MiProd.w}\), with the new choice \begin{equation}\label{eq:MSquint}
\boldsymbol{F}_T^\eta
~=~
\exp \left(\eta \sum_{t=1}^T \boldsymbol{R}_t - \eta^2 \sum_{t=1}^T \boldsymbol{R}_t^2\right)
.
\end{equation}

The argument that Matrix Squint keeps the regret small is very similar to that for Matrix iProd. First, we argue that the weights ensure that the potential decreases (non-strictly) every round. Namely \begin{align*} \Phi_{T+1} &~=~ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \exp \left( \boldsymbol{F}_{T+1}^\eta\right)\right) \right] \\ &~\stackrel{\eqref{eq:gt}}{\le}~ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta \exp \left(\eta \boldsymbol{R}_{T+1} - \eta^2 \boldsymbol{R}_{T+1}^2\right)\right) \right] \\ &~\stackrel{\eqref{eq:prod}}{\le}~ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta \left(\boldsymbol{I}+ \eta \boldsymbol{R}_{T+1}\right) \right) \right] \\ &~=~ \Phi_T + \underbrace{ \mathop{\mathrm{\mathbb E}}_{\gamma(\eta)} \left[ \mathop{\mathrm{tr}}\left( \boldsymbol{F}_T^\eta \eta \boldsymbol{R}_{T+1} \right) \right] }_{\text{$=0$ by $\eqref{eq:MiProd.w}$}} . \end{align*} The rest of the analysis follows that of Matrix iProd. In particular, the regret bound \(\eqref{eq:regret.bound}\) also holds for Matrix Squint.

Let us now think about computing the Matrix Squint prediction \(\eqref{eq:MSquint}\). Following (Koolen and Van Erven 2015) we can compute the integral over \(\eta\) on a grid of \(\ln T\) values. If we maintain the matrices \(\sum_{t=1}^T \boldsymbol{R}_t\) and \(\sum_{t=1}^T \boldsymbol{R}_t^2\) then we have to do \(O(d^3)\) work per round per learning rate to compute the matrix exponentials. Unfortunately, the eigensystem of \[\eta \sum_{t=1}^T \boldsymbol{R}_t - \eta^2 \sum_{t=1}^T \boldsymbol{R}_t^2,\] which is the argument of the exponential, is a function of \(\eta\), meaning that we cannot re-use the eigendecomposition between learning rates. In particular, it seems we cannot apply the closed-form expression of the classical vector Squint potential spectrally.

iProd is sharper than Squint because it delays the prod bound \(\eqref{eq:prod}\) to the analysis. It is very tempting to also try and delay Golden-Thompson \(\eqref{eq:gt}\) to the analysis. Two intuitively reasonable candidates are \[\boldsymbol{F}_T^\eta ~=~ \prod_{t=1}^T (\boldsymbol{I}+ \eta \boldsymbol{R}_T) ,\] and \[\boldsymbol{F}_T^\eta ~=~ \sqrt{\boldsymbol{I}+ \eta \boldsymbol{R}_T} \cdots \sqrt{\boldsymbol{I}+ \eta \boldsymbol{R}_1} \sqrt{\boldsymbol{I}+ \eta \boldsymbol{R}_1} \cdots \sqrt{\boldsymbol{I}+ \eta \boldsymbol{R}_T} .\] The first is the most direct analogue of the vector iProd potential. Unfortunately the resulting weights are not symmetric. The second expression solves this, and the associated weights keep the potential constant. But now the problem reappears in another form, namely that small potential is not necessarily useful.

I have not succeeded in finding a Golden-Thompson free potential. Let me know if you do.

Hazan, Elad, Satyen Kale, and Shai Shalev-Shwartz. 2012. “Near-Optimal Algorithms for Online Matrix Prediction.” In *COLT 2012 - the 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland*, edited by Shie Mannor, Nathan Srebro, and Robert C. Williamson, 23:38.1–13. JMLR Proceedings. JMLR.org. http://www.jmlr.org/proceedings/papers/v23/hazan12b/hazan12b.pdf.

Koolen, Wouter M., and Tim van Erven. 2015. “Second-Order Quantile Methods for Experts and Combinatorial Games.” In *Proceedings of the 28th Annual Conference on Learning Theory (COLT)*, edited by Peter Grn̈wald, Elad Hazan, and Satyen Kale, 1155–75. http://jmlr.org/proceedings/papers/v40/Koolen15a.pdf.

Koolen, Wouter M., Wojciech Kotłowski, and Manfred K. Warmuth. 2011. “Learning Eigenvectors for Free.” In *Advances in Neural Information Processing Systems (NIPS) 24*, edited by John Shawe-Taylor, Richard S. Zemel, Peter Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, 945–53. http://papers.nips.cc/paper/4377-learning-eigenvectors-for-free.pdf.

Tsuda, Koji, Gunnar Rätsch, and Manfred K. Warmuth. 2005. “Matrix Exponentiated Gradient Updates for on-Line Learning and Bregman Projection.” *Journal of Machine Learning Research* 6: 995–1018. http://www.jmlr.org/papers/v6/tsuda05a.html.

Warmuth, Manfred K., and Dima Kuzmin. 2006. “Online Variance Minimization.” In *Proceedings of the 19th Annual Conference on Learning Theory (COLT)*. Pittsburg: Springer.

———. 2008. “Randomized Online PCA Algorithms with Regret Bounds That Are Logarithmic in the Dimension.” *Journal of Machine Learning Research*, October, 2287–2320.

———. 2010. “Bayesian Generalized Probability Calculus for Density Matrices.” *Machine Learning* 78 (1-2): 63–101. https://doi.org/10.1007/s10994-009-5133-7.