2020-10-13

The following question arose during a discussion with Tim van Erven about coordinate-wise algorithms for online optimisation. I wrote the post to be intelligible without that context.

Our starting point is a given batch of data \({\boldsymbol x}_1, \ldots, {\boldsymbol x}_T \in \mathbb R^d\). We consider the following sparsity measure, \[\sum_{i=1}^d \sqrt{\sum_{t=1}^T x_{t,i}^2},\] where lower values signify sparser data. This measure occurs, for example, in the regret bound of the diagonal AdaGrad algorithm. Our sparsity measure is bounded below by the Frobenius norm \(\sqrt{\sum_{t=1}^T \sum_{i=1}^d x_{t,i}^2}\). Moreover, it is equal to the Frobenius norm iff all nonzero \(x_{t,i}\) share a single common dimension \(i\), which indeed is an intuitively sparse setup. In this post we investigate how we can pre-process our data to make it more sparse.

Our approach will be to choose an orthogonal matrix \({\boldsymbol R}\), and look at the transformed data set \({\boldsymbol y}_t = {\boldsymbol R}{\boldsymbol x}_t\) for \(t=1,\ldots,T\). The question is how to pick \({\boldsymbol R}\) to minimise the sparsity measure above.

More precisely, we are looking at the optimisation problem \[V ~=~ \min_{\boldsymbol R}~ \sum_{i=1}^d \sqrt{\sum_{t=1}^T y_{t,i}^2} \qquad \text{where} \qquad {\boldsymbol y}_t = {\boldsymbol R}{\boldsymbol x}_t\] where \({\boldsymbol R}\) ranges over all orthogonal matrices. We have \(y_{t,i}^2 = {\boldsymbol e}_i^\intercal{\boldsymbol R}{\boldsymbol x}_t {\boldsymbol x}_t^\intercal{\boldsymbol R}^\intercal{\boldsymbol e}_i\), and hence \[\sum_{t=1}^T y_{t,i}^2 ~=~ {\boldsymbol e}_i^\intercal{\boldsymbol R}{\boldsymbol G}{\boldsymbol R}^\intercal{\boldsymbol e}_i \qquad \text{where} \qquad {\boldsymbol G}~=~ \sum_{t=1}^T {\boldsymbol x}_t {\boldsymbol x}_t^\intercal .\] With this rewrite, we may write the objective as \[\sum_{i=1}^d \sqrt{{\boldsymbol e}_i^\intercal{\boldsymbol R}{\boldsymbol G}{\boldsymbol R}^\intercal{\boldsymbol e}_i},\] which we see is a function of the diagonal entries of \({\boldsymbol R}{\boldsymbol G}{\boldsymbol R}^\intercal\). The result of this post is that the optimal value is the trace norm (aka nuclear norm) of \({\boldsymbol G}\). More precisely,

**Proposition 1.** *Let the PSD matrix \({\boldsymbol G}\) have eigendecomposition \({\boldsymbol G}= \sum_{i=1}^d \lambda_i {\boldsymbol u}_i {\boldsymbol u}_i^\intercal\). Then \[V ~=~
\min_{\boldsymbol R}~
\sum_{i=1}^d \sqrt{{\boldsymbol e}_i^\intercal{\boldsymbol R}{\boldsymbol G}{\boldsymbol R}^\intercal{\boldsymbol e}_i}
~=~
\sum_{i=1}^d \sqrt{\lambda_i}\] and a best rotation is \({\boldsymbol R}= [{\boldsymbol u}_1, \ldots, {\boldsymbol u}_d]^\intercal\).*

*Proof* Consider the set \(\mathcal D \subseteq \mathbb R^d\) of diagonals (seen as vectors in \(\mathbb R^d\)) of PSD matrices with eigenvalues \({\boldsymbol \lambda}= (\lambda_1,\dots,\lambda_d)\). Note that \(\mathcal D\) contains the diagonal of \({\boldsymbol R}{\boldsymbol G}{\boldsymbol R}^\intercal\). Now by Schur-convexity, \(\mathcal D\) is the convex hull of all permutations of \({\boldsymbol \lambda}\). Hence the concave function \({\boldsymbol z}\mapsto \sum_i \sqrt{z_i}\) is minimised over \(\mathcal D\) at a corner. In fact, the value is the same in each corner (the corners are the permutations of \({\boldsymbol \lambda}\)), where it is equal to \(\sum_{i=1}^d \sqrt{\lambda_i}\). We also see that a minimiser \({\boldsymbol R}\) is obtained by diagonalising \({\boldsymbol G}\). \(\Box\)

We find that the optimal pre-processing step is to rotate the data to diagonalise the Grammian matrix \({\boldsymbol G}\). This is an intuitive thing to do, and this post puts that on a formal footing, relating it to optimising sparsity. It will be interesting to see if coordinate-wise algorithms for online optimisation actually improve with such pre-processing. These algorithms generate the data by querying gradients at iterates, and the pre-processing step will affect the iterate sequence and hence the data sequence. Still, the intuition may hold. A second interesting direction is to think about other pre-processing steps, for example one could consider all invertible linear transformations (say of unit determinant, to preserve the scale). Finally, in online optimisation sparsity is often traded off against the norm of the comparator. It will be cool to study the effect of pre-processing on their interaction.