We look at bounding a symmetric rank one matrix from above by a diagonal matrix.

Introduction

The topic of this post came up during a discussion with Tim van Erven. Sometimes it is important to bound a matrix form above by a diagonal matrix. For example, if the original matrix describes covariance structure between features then a diagonal matrix corresponds to independence. In this post we obtain a diagonal upper bound for a symmetric rank one matrix \({\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}\) (a matrix of this form is sometimes called a dyad). So we are looking for a diagonal matrix \({\boldsymbol{D}}\) such that \[{\boldsymbol{D}}~\succeq~ {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}\] where \(\succeq\) denotes the Löwner order (\({\boldsymbol{A}}\succeq {\boldsymbol{B}}\) if \({\boldsymbol{A}}- {\boldsymbol{B}}\) is positive semi-definite). Now there are many matrices \({\boldsymbol{D}}\) satisfying the above requirement. We want to find the best one. To define optimality we need to put forward a scalar-valued objective. Here we consider minimising the trace of \({\boldsymbol{D}}\). So we are looking for \[V ~:=~ \min_{\substack{ {\boldsymbol{D}}\succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}, \\ \text{${\boldsymbol{D}}$ diagonal} }}~ \operatorname{tr}({\boldsymbol{D}}) .\] As a first step, let us reparametrise by \({\boldsymbol{D}}= \operatorname{diag}({\boldsymbol{d}})\). Then we are looking for \[V ~=~ \min_{ \operatorname{diag}({\boldsymbol{d}}) \succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}}~ \sum_i d_i .\] Now how to deal with the matrix inequality? For \({\boldsymbol{D}}\succeq {\boldsymbol{0}}\) we have \({\boldsymbol{D}}\succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}\) iff \({\boldsymbol{{\ell}}}^{\intercal}{\boldsymbol{D}}^{-1} {\boldsymbol{{\ell}}}\le 1\). This follows from reasoning about the Schur complements of the matrix \(\begin{bmatrix} {\boldsymbol{D}}& {\boldsymbol{{\ell}}}\\ {\boldsymbol{{\ell}}}^{\intercal}& 1 \end{bmatrix}\). Furthermore, inverting a diagonal matrix simply inverts its diagonal elements. So we need to solve \[V ~=~ \min_{ {\boldsymbol{d}}\ge 0 : \sum_i l_i^2/d_i \le 1 }~ \sum_i d_i .\] Introducing a Lagrange multiplier \(\mu\) for the constraint, we find \[V ~=~ \max_{\mu \ge 0} \min_{ {\boldsymbol{d}}\ge 0 }~ \sum_i d_i + \mu {\left( \sum_i l_i^2/d_i - 1 \right)} .\] We solve the innermost optimisation by setting the \(d_i\)-derivative to zero. We find \[0 ~=~ 1 - \mu l_i^2/d_i^2 \qquad \text{and hence} \qquad d_i ~=~ \sqrt{\mu} |l_i|\] Substituting this back in, it hence remains to solve \[V ~=~ \max_{\mu \ge 0}~ \sum_i \sqrt{\mu} |l_i| + \mu {\left( \sum_i \frac{l_i^2}{\sqrt{\mu} |l_i|} - 1 \right)} ~=~ \max_{\mu \ge 0}~ 2 \sqrt{\mu} \sum_i |l_i| - \mu .\] Optimising \(\mu\) by equating its derivative to zero results in \[0 ~=~ \frac{1}{\sqrt{\mu}} \sum_i |l_i| - 1 \qquad \text{so that} \qquad \mu ~=~ {\Big( \sum_i |l_i| \Big)}^2 .\] We hence find optimiser and optimal value \[d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ {\Big(\sum_i |l_i|\Big)}^2 .\] Okay, that is completely explicit. So how far does this method stretch?

Schatten \(p\)-norm Objectives

The previous result is specific to the scalar objective chosen. Yet the method generalises from the trace to any Schatten \(p\)-norm as follows. Define \[V ~:=~ \min_{\substack{ {\boldsymbol{D}}\succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}, \\ \text{${\boldsymbol{D}}$ diagonal} }}~ {\|{\boldsymbol{D}}\|}_p\] where \({\|{\boldsymbol{D}}\|}_p\) denotes the Schatten \(p\)-norm, that is, the vector \(p\)-norm \({\|{\boldsymbol{{\ell}}}\|}_p = (\sum_i |l_i|^p)^{1/p}\) applied to the eigenvalues of the matrix \({\boldsymbol{D}}\). We raise the objective to the power \(p\) for convenience and rewrite as before \[V^p ~=~ \min_{ {\boldsymbol{d}}\ge 0 : \sum_i l_i^2/d_i \le 1 }~ \sum_i d_i^p .\] Introducing Lagrange multiplier \(\mu\) for the constraint, we find \[V^p ~=~ \max_{\mu \ge 0} \min_{ {\boldsymbol{d}}\ge 0 }~ \sum_i d_i^p + \mu {\left(\sum_i l_i^2/d_i - 1\right)} .\] Setting the derivative to \(0\) results in \[d_i ~=~ {\left( \frac{ \mu l_i^2 }{ p } \right)}^{\frac{1}{p+1}}\] so that \[V^p ~=~ \max_{\mu \ge 0}~ (p^{-\frac{p}{p+1}} + p^{\frac{1}{p+1}}) \mu^{\frac{p}{p+1}} \sum_i l_i^{\frac{2 p}{p+1}} - \mu .\] Now optimising for \(\mu\) results in \[\mu ~=~ p {\bigg(\sum_i l_i^{\frac{2 p}{p+1}}\bigg)}^{p+1}\] so that we have optimiser and value \[d_i ~=~ l_i^{\frac{2}{p+1}} \sum_j l_j^{\frac{2 p}{p+1}} \qquad \text{and} \qquad V ~=~ {\bigg( \sum_j l_j^{\frac{2 p}{p+1}} \bigg)}^{\frac{p+1}{p}} .\]

Examples

The case \(p=1\) is called the trace norm or nuclear norm, where we find \[d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ {\Big(\sum_j |l_j|\Big)}^{2} .\] The case \(p=2\) is called the Frobenius norm where we find \[d_i ~=~ l_i^{\frac{2}{3}} \sum_j l_j^{\frac{4}{3}} \qquad \text{and} \qquad V ~=~ {\bigg(\sum_j l_j^{\frac{4}{3}}\bigg)}^{3/2} .\] The case \(p=\infty\) (maximum eigenvalue) is called the spectral norm where we find \[d_i ~=~ \sum_j l_j^2 \qquad \text{and} \qquad V ~=~ \sum_j l_j^2 .\] Note that in this last case the \(d_i\) are all the same, so the optimal matrix is a multiple of the identity matrix.

A case of interest is also \(p=0\). The above analysis probably does not work (as \(p\)-norms switch from concave to convex at \(p=1\)), but we can still show in a different way that the final formula gives the correct answer, i.e. we find \[d_i ~=~ n l_i^2 \qquad \text{and} \qquad V ~=~ n .\]