2015-11-19
We look at bounding a symmetric rank one matrix from above by a diagonal matrix.
The topic of this post came up during a discussion with Tim van Erven. Sometimes it is important to bound a matrix form above by a diagonal matrix. For example, if the original matrix describes covariance structure between features then a diagonal matrix corresponds to independence. In this post we obtain a diagonal upper bound for a symmetric rank one matrix \(\boldsymbol{\ell}\boldsymbol{\ell}^\intercal\) (a matrix of this form is sometimes called a dyad). So we are looking for a diagonal matrix \(\boldsymbol{D}\) such that \[\boldsymbol{D} ~\succeq~ \boldsymbol{\ell}\boldsymbol{\ell}^\intercal\] where \(\succeq\) denotes the Löwner order (\(\boldsymbol{A}\succeq \boldsymbol{B}\) if \(\boldsymbol{A}- \boldsymbol{B}\) is positive semi-definite). Now there are many matrices \(\boldsymbol{D}\) satisfying the above requirement. We want to find the best one. To define optimality we need to put forward a scalar-valued objective. Here we consider minimising the trace of \(\boldsymbol{D}\). So we are looking for \[V ~:=~ \min_{\substack{ \boldsymbol{D}\succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal, \\ \text{$\boldsymbol{D}$ diagonal} }}~ \mathop{\mathrm{tr}}(\boldsymbol{D}) .\] As a first step, let us reparametrise by \(\boldsymbol{D}= \mathop{\mathrm{diag}}(\boldsymbol{d})\). Then we are looking for \[V ~=~ \min_{ \mathop{\mathrm{diag}}(\boldsymbol{d}) \succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal }~ \sum_i d_i .\] Now how to deal with the matrix inequality? For \(\boldsymbol{D}\succeq \boldsymbol{0}\) we have \(\boldsymbol{D}\succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal\) iff \(\boldsymbol{\ell}^\intercal\boldsymbol{D}^{-1} \boldsymbol{\ell}\le 1\). This follows from reasoning about the Schur complements of the matrix \(\begin{bmatrix} \boldsymbol{D}& \boldsymbol{\ell}\\ \boldsymbol{\ell}^\intercal& 1 \end{bmatrix}\). Furthermore, inverting a diagonal matrix simply inverts its diagonal elements. So we need to solve \[V ~=~ \min_{ \boldsymbol{d}\ge 0 : \sum_i l_i^2/d_i \le 1 %\l^\top \diag(\d^{-1}) \l \le 1 }~ \sum_i d_i .\] Introducing a Lagrange multiplier \(\mu\) for the constraint, we find \[V ~=~ \max_{\mu \ge 0} \min_{ \boldsymbol{d}\ge 0 %\l^\top \diag(\d^{-1}) \l \le 1 }~ \sum_i d_i + \mu \left( \sum_i l_i^2/d_i - 1 \right) .\] We solve the innermost optimisation by setting the \(d_i\)-derivative to zero. We find \[0 ~=~ 1 - \mu l_i^2/d_i^2 \qquad \text{and hence} \qquad d_i ~=~ \sqrt{\mu} |l_i|\] Substituting this back in, it hence remains to solve \[V ~=~ \max_{\mu \ge 0}~ \sum_i \sqrt{\mu} |l_i| + \mu \left( \sum_i \frac{l_i^2}{\sqrt{\mu} |l_i|} - 1 \right) ~=~ \max_{\mu \ge 0}~ 2 \sqrt{\mu} \sum_i |l_i| - \mu .\] Optimising \(\mu\) by equating its derivative to zero results in \[0 ~=~ \frac{1}{\sqrt{\mu}} \sum_i |l_i| - 1 \qquad \text{so that} \qquad \mu ~=~ \Big( \sum_i |l_i| \Big)^2 .\] We hence find optimiser and optimal value \[d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ \Big(\sum_i |l_i|\Big)^2 .\] Okay, that is completely explicit. So how far does this method stretch?
The previous result is specific to the scalar objective chosen. Yet the method generalises from the trace to any Schatten \(p\)-norm as follows. Define \[V ~:=~ \min_{\substack{ \boldsymbol{D}\succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal, \\ \text{$\boldsymbol{D}$ diagonal} }}~ \|\boldsymbol{D}\|_p\] where \(\|\boldsymbol{D}\|_p\) denotes the Schatten \(p\)-norm, that is, the vector \(p\)-norm \(\|\boldsymbol{\ell}\|_p = (\sum_i |l_i|^p)^{1/p}\) applied to the eigenvalues of the matrix \(\boldsymbol{D}\). We raise the objective to the power \(p\) for convenience and rewrite as before \[V^p ~=~ \min_{ \boldsymbol{d}\ge 0 : \sum_i l_i^2/d_i \le 1 }~ \sum_i d_i^p .\] Introducing Lagrange multiplier \(\mu\) for the constraint, we find \[V^p ~=~ \max_{\mu \ge 0} \min_{ \boldsymbol{d}\ge 0 }~ \sum_i d_i^p + \mu \left(\sum_i l_i^2/d_i - 1\right) .\] Setting the derivative to \(0\) results in \[d_i ~=~ \left( \frac{ \mu l_i^2 }{ p } \right)^{\frac{1}{p+1}}\] so that \[V^p ~=~ \max_{\mu \ge 0}~ (p^{-\frac{p}{p+1}} + p^{\frac{1}{p+1}}) \mu^{\frac{p}{p+1}} \sum_i l_i^{\frac{2 p}{p+1}} - \mu .\] Now optimising for \(\mu\) results in \[\mu ~=~ p \bigg(\sum_i l_i^{\frac{2 p}{p+1}}\bigg)^{p+1}\] so that we have optimiser and value \[d_i ~=~ l_i^{\frac{2}{p+1}} \sum_j l_j^{\frac{2 p}{p+1}} \qquad \text{and} \qquad V ~=~ \bigg( \sum_j l_j^{\frac{2 p}{p+1}} \bigg)^{\frac{p+1}{p}} .\]
The case \(p=1\) is called the trace norm or nuclear norm, where we find \[d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ \Big(\sum_j |l_j|\Big)^{2} .\] The case \(p=2\) is called the Frobenius norm where we find \[d_i ~=~ l_i^{\frac{2}{3}} \sum_j l_j^{\frac{4}{3}} \qquad \text{and} \qquad V ~=~ \bigg(\sum_j l_j^{\frac{4}{3}}\bigg)^{3/2} .\] The case \(p=\infty\) (maximum eigenvalue) is called the spectral norm where we find \[d_i ~=~ \sum_j l_j^2 \qquad \text{and} \qquad V ~=~ \sum_j l_j^2 .\] Note that in this last case the \(d_i\) are all the same, so the optimal matrix is a multiple of the identity matrix.
A case of interest is also \(p=0\). The above analysis probably does not work (as \(p\)-norms switch from concave to convex at \(p=1\)), but we can still show in a different way that the final formula gives the correct answer, i.e. we find \[d_i ~=~ n l_i^2 \qquad \text{and} \qquad V ~=~ n .\]