We look at bounding a symmetric rank one matrix from above by a diagonal matrix.

# Introduction

The topic of this post came up during a discussion with Tim van Erven. Sometimes it is important to bound a matrix form above by a diagonal matrix. For example, if the original matrix describes covariance structure between features then a diagonal matrix corresponds to independence. In this post we obtain a diagonal upper bound for a symmetric rank one matrix $${\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}$$ (a matrix of this form is sometimes called a dyad). So we are looking for a diagonal matrix $${\boldsymbol{D}}$$ such that ${\boldsymbol{D}}~\succeq~ {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}$ where $$\succeq$$ denotes the Löwner order ($${\boldsymbol{A}}\succeq {\boldsymbol{B}}$$ if $${\boldsymbol{A}}- {\boldsymbol{B}}$$ is positive semi-definite). Now there are many matrices $${\boldsymbol{D}}$$ satisfying the above requirement. We want to find the best one. To define optimality we need to put forward a scalar-valued objective. Here we consider minimising the trace of $${\boldsymbol{D}}$$. So we are looking for $V ~:=~ \min_{\substack{ {\boldsymbol{D}}\succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}, \\ \text{{\boldsymbol{D}} diagonal} }}~ \operatorname{tr}({\boldsymbol{D}}) .$ As a first step, let us reparametrise by $${\boldsymbol{D}}= \operatorname{diag}({\boldsymbol{d}})$$. Then we are looking for $V ~=~ \min_{ \operatorname{diag}({\boldsymbol{d}}) \succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}}~ \sum_i d_i .$ Now how to deal with the matrix inequality? For $${\boldsymbol{D}}\succeq {\boldsymbol{0}}$$ we have $${\boldsymbol{D}}\succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}$$ iff $${\boldsymbol{{\ell}}}^{\intercal}{\boldsymbol{D}}^{-1} {\boldsymbol{{\ell}}}\le 1$$. This follows from reasoning about the Schur complements of the matrix $$\begin{bmatrix} {\boldsymbol{D}}& {\boldsymbol{{\ell}}}\\ {\boldsymbol{{\ell}}}^{\intercal}& 1 \end{bmatrix}$$. Furthermore, inverting a diagonal matrix simply inverts its diagonal elements. So we need to solve $V ~=~ \min_{ {\boldsymbol{d}}\ge 0 : \sum_i l_i^2/d_i \le 1 }~ \sum_i d_i .$ Introducing a Lagrange multiplier $$\mu$$ for the constraint, we find $V ~=~ \max_{\mu \ge 0} \min_{ {\boldsymbol{d}}\ge 0 }~ \sum_i d_i + \mu {\left( \sum_i l_i^2/d_i - 1 \right)} .$ We solve the innermost optimisation by setting the $$d_i$$-derivative to zero. We find $0 ~=~ 1 - \mu l_i^2/d_i^2 \qquad \text{and hence} \qquad d_i ~=~ \sqrt{\mu} |l_i|$ Substituting this back in, it hence remains to solve $V ~=~ \max_{\mu \ge 0}~ \sum_i \sqrt{\mu} |l_i| + \mu {\left( \sum_i \frac{l_i^2}{\sqrt{\mu} |l_i|} - 1 \right)} ~=~ \max_{\mu \ge 0}~ 2 \sqrt{\mu} \sum_i |l_i| - \mu .$ Optimising $$\mu$$ by equating its derivative to zero results in $0 ~=~ \frac{1}{\sqrt{\mu}} \sum_i |l_i| - 1 \qquad \text{so that} \qquad \mu ~=~ {\Big( \sum_i |l_i| \Big)}^2 .$ We hence find optimiser and optimal value $d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ {\Big(\sum_i |l_i|\Big)}^2 .$ Okay, that is completely explicit. So how far does this method stretch?

## Schatten $$p$$-norm Objectives

The previous result is specific to the scalar objective chosen. Yet the method generalises from the trace to any Schatten $$p$$-norm as follows. Define $V ~:=~ \min_{\substack{ {\boldsymbol{D}}\succeq {\boldsymbol{{\ell}}}{\boldsymbol{{\ell}}}^{\intercal}, \\ \text{{\boldsymbol{D}} diagonal} }}~ {\|{\boldsymbol{D}}\|}_p$ where $${\|{\boldsymbol{D}}\|}_p$$ denotes the Schatten $$p$$-norm, that is, the vector $$p$$-norm $${\|{\boldsymbol{{\ell}}}\|}_p = (\sum_i |l_i|^p)^{1/p}$$ applied to the eigenvalues of the matrix $${\boldsymbol{D}}$$. We raise the objective to the power $$p$$ for convenience and rewrite as before $V^p ~=~ \min_{ {\boldsymbol{d}}\ge 0 : \sum_i l_i^2/d_i \le 1 }~ \sum_i d_i^p .$ Introducing Lagrange multiplier $$\mu$$ for the constraint, we find $V^p ~=~ \max_{\mu \ge 0} \min_{ {\boldsymbol{d}}\ge 0 }~ \sum_i d_i^p + \mu {\left(\sum_i l_i^2/d_i - 1\right)} .$ Setting the derivative to $$0$$ results in $d_i ~=~ {\left( \frac{ \mu l_i^2 }{ p } \right)}^{\frac{1}{p+1}}$ so that $V^p ~=~ \max_{\mu \ge 0}~ (p^{-\frac{p}{p+1}} + p^{\frac{1}{p+1}}) \mu^{\frac{p}{p+1}} \sum_i l_i^{\frac{2 p}{p+1}} - \mu .$ Now optimising for $$\mu$$ results in $\mu ~=~ p {\bigg(\sum_i l_i^{\frac{2 p}{p+1}}\bigg)}^{p+1}$ so that we have optimiser and value $d_i ~=~ l_i^{\frac{2}{p+1}} \sum_j l_j^{\frac{2 p}{p+1}} \qquad \text{and} \qquad V ~=~ {\bigg( \sum_j l_j^{\frac{2 p}{p+1}} \bigg)}^{\frac{p+1}{p}} .$

#### Examples

The case $$p=1$$ is called the trace norm or nuclear norm, where we find $d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ {\Big(\sum_j |l_j|\Big)}^{2} .$ The case $$p=2$$ is called the Frobenius norm where we find $d_i ~=~ l_i^{\frac{2}{3}} \sum_j l_j^{\frac{4}{3}} \qquad \text{and} \qquad V ~=~ {\bigg(\sum_j l_j^{\frac{4}{3}}\bigg)}^{3/2} .$ The case $$p=\infty$$ (maximum eigenvalue) is called the spectral norm where we find $d_i ~=~ \sum_j l_j^2 \qquad \text{and} \qquad V ~=~ \sum_j l_j^2 .$ Note that in this last case the $$d_i$$ are all the same, so the optimal matrix is a multiple of the identity matrix.

A case of interest is also $$p=0$$. The above analysis probably does not work (as $$p$$-norms switch from concave to convex at $$p=1$$), but we can still show in a different way that the final formula gives the correct answer, i.e. we find $d_i ~=~ n l_i^2 \qquad \text{and} \qquad V ~=~ n .$