1-Wasserstein distance: Kantorovich–Rubinstein duality

The Kantorovich–Rubinstein distance, popularly known to the machine learning community as the Wasserstein distance, is a metric to compute the distance between two probability measures. The 1-Wasserstein is the most common variant of the Wasserstein distances (thanks to WGAN and its variants). Its dual form is generally used in an adversarial setup which is defined as

where $\mu$ and $\nu$ are probability measures, and $\mathrm{Lip}(f)$ denotes the Lipschitz constant of the function $f$. The function $f$ is realized using a neural network and the Lipschitz constraint is enforced using various techniques such as weight-clipping, gradient penalty, and spectral normalization. The neural network is trained to maximize the integral $\int fd\mu(x) - \int fd\nu(y)$.

Eq. ($\ref{eq:w1dual}$) is the dual form of the Wasserstein distance which has the general form

where $d(x,y)$ is a distance metric between two points $x$ and $y$, and $\gamma$ is a coupling of the probability measures $\mu$ and $\nu$. Without going into too much jargon, a coupling $\gamma$ can be thought of as a joint distribution that belongs to the set of joint distributions $\Gamma(\mu, \nu)$ that have marginals $\mu$ and $\nu$. For the remainder of this note, we’ll set $d(x,y) = |x-y|$ and $p=1$.

Generally in machine learning literature, the dual form and the Lipschitz constraints are directly introduced and there’s a lack of resources on the internet that address these basics. To this end, in this post I will explain how to arrive at the dual form (Eq. $\ref{eq:w1dual}$) from the primal form:

The treatment in this post will be a non-rigorous one. Please refer to [1] for a rigorous proof.

Optimal Transport Map
Figure 1. Optimal transport map $T$ from $\mu$ to $\nu$.

Kantorovich Duality

We first remove the $\gamma \in \Gamma(\mu, \nu)$ constraint from Eq. ($\ref{eq:w1primal}$) and add it as a supremum instead.

where $f$ and $g$ are absolutely integrable functions w.r.t. $\mu$ and $\nu$ respectively. Eq. ($\ref{eq:derivstep1}$) encodes the $\gamma \in \Gamma(\mu, \nu)$ constraint because the supremum is $0$ when $\gamma \in \Gamma(\mu, \nu)$ and is $\infty$ otherwise. This is because

and would cancel the other terms. In any other case, $f$ and $g$ can be suitably chosen such that the supremum becomes $\infty$.

We can now take the $\sup_{f,g}$ outside because the first term does not depend on $f$ and $g$,

We can now invoke the minimax principle to replace the $\inf\sup$ with a $\sup\inf$ under certain conditions which are beyond the scope of this post.

When $|x-y| \geq f(x)+g(y)$, the value of $\inf_{\gamma} \int (|x-y|-f(x)+g(y))d\gamma(x,y)$ is $0$ and is $-\infty$ otherwise. This can be added as a constraint in the equation as follows

$c$-transform

How do we now find such $f$ and $g$ the maximize the r.h.s. of Eq. ($\ref{eq:w1dualgeneral}$)?

Let’s assume that we have a function $f$ and we want to find the optimal $g$ corresponding to $f$ that achieves the supremum in Eq. ($\ref{eq:w1dualgeneral}$). We know that $\forall x,y \, f(x) + g(y) \leq |x-y|$. We can write this as follows

because if $g(y) \leq |x-y| - f(x) \, \forall x,y$, then it must be true for the $x$ that minimizes the r.h.s. (which also makes sure that it is true for all other $x$s). Since, $g(y) \leq \inf_{x} |x-y| - f(x)$, the best we can do to maximize the r.h.s. in Eq. ($\ref{eq:w1dualgeneral}$) is set

.

Eq. ($\ref{eq:ctransform1}$) gives us a function which is called the $c$-transform of $f$ and is often denoted by $f^c$,

It can be shown that $f^{cc} = f$. We can now write Eq. ($\ref{eq:w1dualgeneral}$) as

Special case for cost $| x-y |$

The above duality holds for any arbitrary cost $c(x,y)$, not just for $c(x,y)=|x-y|$. In the case of $|x-y|$, let’s derive the form of the dual problem (Eq. $\ref{eq:w1dualgeneral}$) when $f$ is 1-Lipschitz.

When $f$ is 1-Lipschitz, $f^c$ is 1-Lipschitz too. This is true because for any given $x$,

is 1-Lipschitz and therefore the infimum of the r.h.s. $f^c(y) = \inf_{x} |x-y| - f(x)$ is 1-Lipschitz.

Since $f^c$ is 1-Lipschitz, for all $x$ and $y$ we have

Since Eq. ($\ref{ineq:lip}$) is true for all $x$ and $y$,

where the right inequality follows by choosing $y = x$ in the infimum. We know that $f^{cc} = f$. This means that $-f^c(x)$ must be equal to $f(x)$ for Eq. ($\ref{ineq:sandwich}$) to hold.

Substituting $f^c = -f$ in Eq. ($\ref{eq:w1dualffc}$), we get

which is the dual form of 1-Wasserstein distance.

References

[1] Cédric Villani: Topics in Optimal Transportation, Chapter 1.
[2] Vincent Herrmann : Wasserstein GAN and the Kantorovich-Rubinstein Duality (Blog).
[3] Marco Cuturi: A Primer on Optimal Transport (Talk).

Written on April 8, 2020