Introduction

We start by recalling the (Shannon) relative entropy and its duals. The relative entropy between two nonnegative (element-wise) vectors \({\boldsymbol{w}}\) and \({\boldsymbol{u}}\) is defined by \[\operatorname{\triangle}{\left({\boldsymbol{w}}\middle\|{\boldsymbol{u}}\right)} ~=~ \sum_k {\left(w_k \ln \frac{w_k}{u_k} + u_k - w_k\right)} .\] Relative entropy (aka Kullback-Leibler divergence) is the Bregman divergence generated by the convex function \({\boldsymbol{w}}\mapsto \sum_k w_k \ln w_k\). When considering probability vectors \({\boldsymbol{w}}\) and \({\boldsymbol{u}}\) (nonnegative vectors with unit sum) the last two terms cancel in sum and are typically omitted. Then again, the relative entropy is well-defined on all non-negative vectors.

Let \({\boldsymbol{l}}\) be any vector. We look at two duals of relative entropy. First, we consider the dual on the space of nonnegative vectors (the domain of the relative entropy). Let \({\boldsymbol{u}}\) be a nonnegative vector. Then \begin{equation}\label{eq:Shan.full} \min_{{\boldsymbol{w}}}~ \operatorname{\triangle}{\left({\boldsymbol{w}}\middle\|{\boldsymbol{u}}\right)} + {\boldsymbol{w}}^{\intercal}{\boldsymbol{l}}~=~ \sum_k u_k {\left(1 - e^{- l_k}\right)} . \end{equation} Second, we consider the dual on the space of probability vectors. If \({\boldsymbol{u}}\) is a probability vector then \begin{equation}\label{eq:Shan.unit} \min_{{\boldsymbol{w}}: {\boldsymbol{1}}^{\intercal}{\boldsymbol{w}}= 1}~ \operatorname{\triangle}{\left({\boldsymbol{w}}\middle\|{\boldsymbol{u}}\right)} + {\boldsymbol{w}}^{\intercal}{\boldsymbol{l}}~=~ - \ln \sum_k u_k e^{ - l_k} . \end{equation} If you have not seen these before, both are fun exercises.

von Neumann relative entropy

We now consider von Neumann’s extension of the Shannon relative entropy to positive semi-definite (PSD) matrices. PSD matrices are real-valued, symmetric and have nonnegative eigenvalues. The von Neumann relative entropy is defined by \[\operatorname{\triangle}{\left({\boldsymbol{W}}\middle\|{\boldsymbol{U}}\right)} ~=~ \operatorname{tr}{\left( {\boldsymbol{W}}\ln {\boldsymbol{W}}- {\boldsymbol{W}}\ln {\boldsymbol{U}}- {\boldsymbol{W}}+ {\boldsymbol{U}}\right)} .\] Recall that the logarithm and exponential of a symmetric matrix are obtained by applying the corresponding operation to its eigenvalues, and that the trace is the sum of the eigenvalues. von Neumann relative entropy is the Bregman divergence generated by the convex function \({\boldsymbol{W}}\mapsto \operatorname{tr}{\left({\boldsymbol{W}}\ln {\boldsymbol{W}}\right)}\). When considering density matrices (PSD matrices of unit trace) the last two terms above cancel and are typically omitted. von Neumann relative entropy reduces to Shannon relative entropy for diagonal matrices.

In this post we compute the dual of the entropy. As for Shannon above, we do this both on the full domain and on the set of density matrices.

Unconstrained dual

Fix PSD matrix \({\boldsymbol{U}}\) and symmetric matrix \({\boldsymbol{L}}\). We aim to compute \[\min_{{\boldsymbol{W}}}~ \operatorname{\triangle}{\left({\boldsymbol{W}}\middle\|{\boldsymbol{U}}\right)} + \operatorname{tr}{\left({\boldsymbol{W}}{\boldsymbol{L}}\right)}\] where \({\boldsymbol{W}}\) ranges over PSD matrices (the domain of the relative entropy). To compute the minimiser, we set the derivative to zero. That is, we require \({\boldsymbol{0}}= \ln {\boldsymbol{W}}- \ln {\boldsymbol{U}}+ {\boldsymbol{L}}\) which resolves to \[{\boldsymbol{W}}~=~ \exp{\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)} .\] If \({\boldsymbol{U}}\) and \({\boldsymbol{L}}\) commute we may simplify this to \({\boldsymbol{W}}= {\boldsymbol{U}}\exp {\left(-{\boldsymbol{L}}\right)}\), but they typically do not. The value is given by \[\min_{{\boldsymbol{W}}}~ \operatorname{\triangle}{\left({\boldsymbol{W}}\middle\|{\boldsymbol{U}}\right)} + \operatorname{tr}{\left({\boldsymbol{W}}{\boldsymbol{L}}\right)} ~=~ \operatorname{tr}{\left( {\boldsymbol{U}}- \exp{\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)} \right)} .\] Comparing to \(\eqref{eq:Shan.full}\), we see that this is very similar. Yet in the interesting case that \({\boldsymbol{U}}\) and \({\boldsymbol{L}}\) do not commute, we see that their interaction is more complicated, and \({\boldsymbol{U}}\) cannot be factored out.

Dual w.r.t. the set of density matrices

Fix a symmetric matrix \({\boldsymbol{L}}\) and a density matrix \({\boldsymbol{U}}\) (meaning \(\operatorname{tr}{\boldsymbol{U}}= 1\)). We study the problem \[\min_{{\boldsymbol{W}}: \operatorname{tr}{\boldsymbol{W}}= 1}~ \operatorname{\triangle}{\left({\boldsymbol{W}}\middle\|{\boldsymbol{U}}\right)} + \operatorname{tr}{\left({\boldsymbol{W}}{\boldsymbol{L}}\right)}\] where \({\boldsymbol{W}}\) ranges over all density matrices. Introducing the Lagrange multiplier \(\lambda\) for the unit trace constraint, and subsequently equating the derivative to zero results in \({\boldsymbol{0}}= \ln {\boldsymbol{W}}- \ln {\boldsymbol{U}}+ {\boldsymbol{L}}- \lambda {\boldsymbol{I}}\). We hence find that \({\boldsymbol{W}}= e^\lambda \exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)}\) and upon setting \(\lambda\) to enforce the normalisation we find minimiser \[{\boldsymbol{W}}~=~ \frac{\exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)}}{\operatorname{tr}\exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)}} ,\] and objective value \begin{align*} & \min_{{\boldsymbol{W}}: \operatorname{tr}{\boldsymbol{W}}= 1}~ \operatorname{\triangle}{\left({\boldsymbol{W}}\middle\|{\boldsymbol{U}}\right)} + \operatorname{tr}{\left({\boldsymbol{W}}{\boldsymbol{L}}\right)} \\ &~=~ \operatorname{tr}{\left( {\boldsymbol{W}}\ln \frac{\exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)}}{\operatorname{tr}\exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)}} - {\boldsymbol{W}}\ln {\boldsymbol{U}}+ {\boldsymbol{W}}{\boldsymbol{L}}\right)} \\ &~=~ - \ln \operatorname{tr}\exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)} . \end{align*} Comparing to \(\eqref{eq:Shan.unit}\), we see that this is again very similar. Yet in the interesting case that \({\boldsymbol{U}}\) and \({\boldsymbol{L}}\) do not commute, we see that their interaction is more complicated. The “taking an average under \({\boldsymbol{u}}\)” flavour is not preserved for matrices; now all the action resides in the exponent.

A forced commute

Finally, we compare the actual duals to the expressions that we would obtain when \({\boldsymbol{U}}\) and \({\boldsymbol{L}}\) commute, and conclude that the latter are always smaller. The key step is the Golden-Thompson inequality \(\operatorname{tr}{\left(e^{{\boldsymbol{A}}+{\boldsymbol{B}}}\right)} \le \operatorname{tr}{\left(e^{{\boldsymbol{A}}} e^{{\boldsymbol{B}}}\right)}\) for any symmetric \({\boldsymbol{A}}\) and \({\boldsymbol{B}}\). So \[\operatorname{tr}{\left( {\boldsymbol{U}}- \exp{\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)} \right)} ~\ge~ \operatorname{tr}{\left( {\boldsymbol{U}}{\left({\boldsymbol{I}}- \exp{\left(-{\boldsymbol{L}}\right)}\right)} \right)} ,\] and \[- \ln \operatorname{tr}\exp {\left(\ln {\boldsymbol{U}}- {\boldsymbol{L}}\right)} ~\ge~ - \ln \operatorname{tr}{\left({\boldsymbol{U}}\exp{\left( - {\boldsymbol{L}}\right)}\right)} .\]