2022-10-27

Abstract

We look at the value of a two-player zero-sum game as a function of its payoff matrix. We see that even if that matrix is a smooth function of some parameter, the value of the game is not differentiable.

I was surprised today by the following. Consider a \(K \times M\) matrix, \(A(\lambda)\) where each entry is given by a
nice (say infinitely differentiable) function \(A_{i,j}(\lambda)\). Let us interpret \(A(\lambda)\) as the payoff matrix of a
two-player zero-sum game, and let us investigate the value function
\[V(\lambda) ~=~ \min_{p \in \triangle_K}
\max_{q \in \triangle_M}~ \sum_{i,j} p_i q_j A_{i,j}(\lambda).\]
Here, we ask if \(V(\lambda)\) is a
nice function of \(\lambda\). We will
see that it is continuous but not differentiable. Not even when the
\(A_{i,j}(\lambda)\) are
*linear* in \(\lambda\).

In the below figure we investigate the situation for \[A(\lambda) ~=~ \lambda \begin{pmatrix} 0.13 & 0.21 & 2.75 & -0.63\\ 1.45 & -2.0 & -1.02 & 0.56\\ -0.52 & -1.7 & -1.36 & -1.5 \end{pmatrix} + (1-\lambda) \begin{pmatrix} -0.83 & 1.24 & 0.25 & 0.98\\ 1.29 & 1.47 & -1.05 & 0.1\\ 1.45 & 2.22 & 1.47 & -1.29 \end{pmatrix}\] These coefficients were drawn at random until a pleasing plot appeared. The fact that this only took about 10 resamplings indicates that the situation illustrated by the plot below is not rare at all.

And just for good measure, here is a second example where the value
function has a *local* minimum (which in fact consists of two
local minima straddling a local maximum).

We see that even when the payoffs are each linear functions of a scalar parameter \(\lambda\), the value of the game is not differentiable. Is that surprising? Maybe. How about the following? Let \(p^*(\lambda)\) and \(q^*(\lambda)\) be the saddle point for \(A(\lambda)\) (which is indeed typically unique). Then by the rule for differentiating optima and by the chain rule, \begin{align*} \nabla_\lambda V(\lambda) &~=~ \nabla_\lambda \left( p^*(\lambda)^\intercal A(\lambda) q^*(\lambda) \right) \\ &~=~ (\nabla_\lambda p^*(\lambda))^\intercal A(\lambda) q^*(\lambda) + p^*(\lambda)^\intercal(\nabla_\lambda A(\lambda)) q^*(\lambda) + p^*(\lambda)^\intercal A(\lambda) (\nabla_\lambda q^*(\lambda)) \\ &~=~ p^*(\lambda)^\intercal(\nabla_\lambda A(\lambda)) q^*(\lambda) \end{align*} The first and last contribution are in fact zero. Why? Because optimality of \(p\) in \(V(p,A,q)\) requires \(V^{(1)}(p,a,q) = 0\). It is actually a bit more complicated, as \(p(\lambda)\) sums to one, and hence \(\nabla_\lambda p(\lambda)\) sums to zero. Moreover, \(A(\lambda)q^*(\lambda)\) is a constant. So the non-differentiability in the derivative \(\nabla_\lambda V(\lambda)\) is not due to nondifferentiability of the entries in \(A(\lambda)\). It is due to the optimal weights jumping. This can clearly be seen in the first figure. For example, the jump up in the derivative at \(\lambda = 0.45\) is caused by a jump in \(q^*(\lambda)\).

Why is all this interesting or relevant? Well, for example because when training GANs one optimises the parameters of an objective that is of min-max shape. As illustrated above, such objectives are complex to optimise. Even in one dimension.