For a differentiable convex function \(\phi : \mathbb R \to \mathbb R\) the Bregman divergence from \(x\) to \(y\) is defined as \[\triangle_\phi(x\|y) ~=~ \phi(x) - \phi(y) - (x-y) \phi'(y) .\] Bregman divergences are very popular and prevalent in machine learning. In this post we think about what happens if we start with \(\phi\) convex but not differentiable. In particular we look at the simplest such function, the absolute value function. One way to approach this question is to think about the selection of sub-gradients. Here we study this question by means of limits of natural families of convex functions.

We consider two families of differentiable convex functions. First the \(p\)th power for \(p > 1\). \[F_p(x) ~=~ |x|^p\] which gives \[\triangle_{F_p}(x\|y) ~=~ |x|^{p} - \operatorname{sgn}(y) \big( p x + (1-p) y \big) |y|^{p-1}\] and secondly the log-cosh function (recall that \(\sinh(\theta) = \frac{e^{\theta} - e^{-\theta}}{2}\), \(\cosh(\theta) = \frac{e^{\theta} + e^{-\theta}}{2}\) and \(\tanh(\theta) = \frac{\sinh(\theta)}{\cosh(\theta)}\)) \[G_\eta(x) ~=~ \frac{1}{\eta} \ln (\cosh (\eta x))\] which gives \[\triangle_{G_\eta}(x\|y) ~=~ \frac{1}{\eta } \ln \frac{ \cosh (\eta x) }{ \cosh (\eta y) } + (y-x) \tanh (\eta y) .\] Here is a graph showing both \(F_p\) and \(G_\eta\) (this is both the graph of \(F_p(x)\) and \(G_\eta(x)\) and of \(\triangle_{F_p}(x\|0)\) and \(\triangle_{G_\eta}(x\|0)\)):

Now by design, we can take the limit of \(p \to 1\) or \(\eta \to \infty\), pointwise, and we end up with the the absolute value function: \[H(x) ~:=~ |x| ~=~ \begin{cases} \lim_{p \to 1} F_p(x) \\ \lim_{\eta \to \infty} G_\eta(x) \end{cases}\] The absolute value function is convex, but not differentiable. Yet, by the same pointwise limit construction we can associate a perhaps natural Bregman divergence to it, namely \[\triangle_H(x\|y) ~:=~ \begin{cases} \lim_{p \to 1} \triangle_{F_p}(x\|y) \\ \lim_{\eta \to \infty} \triangle_{G_\eta}(x\|y) \end{cases} ~=~ |x| - \operatorname{sgn}(y) x .\] Again the limit coincides for both starting points. This divergence is interesting. It does only depend on \(y\) through its sign. We may rewrite it as follows \[\triangle_H(x\|y) ~=~ \begin{cases} 2 x_-& y > 0 \\ 2 x_+ & y < 0 \\ |x| & y = 0 \end{cases}\] where \(x_+ = \max\{0,x\}\), \(x_- = \max\{0,-x\}\) so that \(|x| = x_- + x_+\). It is a nice convex function in \(x\), but discontinuous in \(y\) at \(y=0\). Now if \(y \in \{\pm 1\}\), we find \[\triangle_H(x\|y) ~=~ 2 \max \{0, -y x\}\] which can be thought of as a zero-margin hinge loss.

One of the powerful properties of Bregman Divergences is that the expected divergence from a random vector \(X\) is minimized by the mean. That is \[\operatorname*{argmin}_y\,
\operatorname*{\mathbb E}[ \triangle_\phi(X\|y)]
~=~
\operatorname*{\mathbb E}[X]
.\] In this sense Bregman divergences are *proper*. Let us check whether that this is still true for the absolute value divergence. We have \[\operatorname*{\mathbb E}[ \triangle_H(X\|y)]
~=~
\begin{cases}
2 \operatorname*{\mathbb E}[ X_-]& y > 0
\\
2 \operatorname*{\mathbb E}[X_+ ]& y < 0
\\
\operatorname*{\mathbb E}[ |X|] & y = 0
\end{cases}\] Now indeed, as \(\operatorname*{\mathbb E}[X] = \operatorname*{\mathbb E}[X_+] - \operatorname*{\mathbb E}[X_-]\), this is minimised by \(y = \operatorname*{\mathbb E}[X]\). Of course it is also minimised by any other value with the same sign, but it is still cute that propriety is preserved non-strictly.