Tight Diagonal Upper Bounds on Dyads

Wouter M. Koolen



We look at bounding a symmetric rank one matrix from above by a diagonal matrix.


The topic of this post came up during a discussion with Tim van Erven. Sometimes it is important to bound a matrix form above by a diagonal matrix. For example, if the original matrix describes covariance structure between features then a diagonal matrix corresponds to independence. In this post we obtain a diagonal upper bound for a symmetric rank one matrix \(\boldsymbol{\ell}\boldsymbol{\ell}^\intercal\) (a matrix of this form is sometimes called a dyad). So we are looking for a diagonal matrix \(\boldsymbol{D}\) such that \[\boldsymbol{D} ~\succeq~ \boldsymbol{\ell}\boldsymbol{\ell}^\intercal\] where \(\succeq\) denotes the Löwner order (\(\boldsymbol{A}\succeq \boldsymbol{B}\) if \(\boldsymbol{A}- \boldsymbol{B}\) is positive semi-definite). Now there are many matrices \(\boldsymbol{D}\) satisfying the above requirement. We want to find the best one. To define optimality we need to put forward a scalar-valued objective. Here we consider minimising the trace of \(\boldsymbol{D}\). So we are looking for \[V ~:=~ \min_{\substack{ \boldsymbol{D}\succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal, \\ \text{$\boldsymbol{D}$ diagonal} }}~ \mathop{\mathrm{tr}}(\boldsymbol{D}) .\] As a first step, let us reparametrise by \(\boldsymbol{D}= \mathop{\mathrm{diag}}(\boldsymbol{d})\). Then we are looking for \[V ~=~ \min_{ \mathop{\mathrm{diag}}(\boldsymbol{d}) \succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal }~ \sum_i d_i .\] Now how to deal with the matrix inequality? For \(\boldsymbol{D}\succeq \boldsymbol{0}\) we have \(\boldsymbol{D}\succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal\) iff \(\boldsymbol{\ell}^\intercal\boldsymbol{D}^{-1} \boldsymbol{\ell}\le 1\). This follows from reasoning about the Schur complements of the matrix \(\begin{bmatrix} \boldsymbol{D}& \boldsymbol{\ell}\\ \boldsymbol{\ell}^\intercal& 1 \end{bmatrix}\). Furthermore, inverting a diagonal matrix simply inverts its diagonal elements. So we need to solve \[V ~=~ \min_{ \boldsymbol{d}\ge 0 : \sum_i l_i^2/d_i \le 1 %\l^\top \diag(\d^{-1}) \l \le 1 }~ \sum_i d_i .\] Introducing a Lagrange multiplier \(\mu\) for the constraint, we find \[V ~=~ \max_{\mu \ge 0} \min_{ \boldsymbol{d}\ge 0 %\l^\top \diag(\d^{-1}) \l \le 1 }~ \sum_i d_i + \mu \left( \sum_i l_i^2/d_i - 1 \right) .\] We solve the innermost optimisation by setting the \(d_i\)-derivative to zero. We find \[0 ~=~ 1 - \mu l_i^2/d_i^2 \qquad \text{and hence} \qquad d_i ~=~ \sqrt{\mu} |l_i|\] Substituting this back in, it hence remains to solve \[V ~=~ \max_{\mu \ge 0}~ \sum_i \sqrt{\mu} |l_i| + \mu \left( \sum_i \frac{l_i^2}{\sqrt{\mu} |l_i|} - 1 \right) ~=~ \max_{\mu \ge 0}~ 2 \sqrt{\mu} \sum_i |l_i| - \mu .\] Optimising \(\mu\) by equating its derivative to zero results in \[0 ~=~ \frac{1}{\sqrt{\mu}} \sum_i |l_i| - 1 \qquad \text{so that} \qquad \mu ~=~ \Big( \sum_i |l_i| \Big)^2 .\] We hence find optimiser and optimal value \[d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ \Big(\sum_i |l_i|\Big)^2 .\] Okay, that is completely explicit. So how far does this method stretch?

Schatten \(p\)-norm Objectives

The previous result is specific to the scalar objective chosen. Yet the method generalises from the trace to any Schatten \(p\)-norm as follows. Define \[V ~:=~ \min_{\substack{ \boldsymbol{D}\succeq \boldsymbol{\ell}\boldsymbol{\ell}^\intercal, \\ \text{$\boldsymbol{D}$ diagonal} }}~ \|\boldsymbol{D}\|_p\] where \(\|\boldsymbol{D}\|_p\) denotes the Schatten \(p\)-norm, that is, the vector \(p\)-norm \(\|\boldsymbol{\ell}\|_p = (\sum_i |l_i|^p)^{1/p}\) applied to the eigenvalues of the matrix \(\boldsymbol{D}\). We raise the objective to the power \(p\) for convenience and rewrite as before \[V^p ~=~ \min_{ \boldsymbol{d}\ge 0 : \sum_i l_i^2/d_i \le 1 }~ \sum_i d_i^p .\] Introducing Lagrange multiplier \(\mu\) for the constraint, we find \[V^p ~=~ \max_{\mu \ge 0} \min_{ \boldsymbol{d}\ge 0 }~ \sum_i d_i^p + \mu \left(\sum_i l_i^2/d_i - 1\right) .\] Setting the derivative to \(0\) results in \[d_i ~=~ \left( \frac{ \mu l_i^2 }{ p } \right)^{\frac{1}{p+1}}\] so that \[V^p ~=~ \max_{\mu \ge 0}~ (p^{-\frac{p}{p+1}} + p^{\frac{1}{p+1}}) \mu^{\frac{p}{p+1}} \sum_i l_i^{\frac{2 p}{p+1}} - \mu .\] Now optimising for \(\mu\) results in \[\mu ~=~ p \bigg(\sum_i l_i^{\frac{2 p}{p+1}}\bigg)^{p+1}\] so that we have optimiser and value \[d_i ~=~ l_i^{\frac{2}{p+1}} \sum_j l_j^{\frac{2 p}{p+1}} \qquad \text{and} \qquad V ~=~ \bigg( \sum_j l_j^{\frac{2 p}{p+1}} \bigg)^{\frac{p+1}{p}} .\]


The case \(p=1\) is called the trace norm or nuclear norm, where we find \[d_i ~=~ |l_i| \sum_j |l_j| \qquad \text{and} \qquad V ~=~ \Big(\sum_j |l_j|\Big)^{2} .\] The case \(p=2\) is called the Frobenius norm where we find \[d_i ~=~ l_i^{\frac{2}{3}} \sum_j l_j^{\frac{4}{3}} \qquad \text{and} \qquad V ~=~ \bigg(\sum_j l_j^{\frac{4}{3}}\bigg)^{3/2} .\] The case \(p=\infty\) (maximum eigenvalue) is called the spectral norm where we find \[d_i ~=~ \sum_j l_j^2 \qquad \text{and} \qquad V ~=~ \sum_j l_j^2 .\] Note that in this last case the \(d_i\) are all the same, so the optimal matrix is a multiple of the identity matrix.

A case of interest is also \(p=0\). The above analysis probably does not work (as \(p\)-norms switch from concave to convex at \(p=1\)), but we can still show in a different way that the final formula gives the correct answer, i.e. we find \[d_i ~=~ n l_i^2 \qquad \text{and} \qquad V ~=~ n .\]