## Background

*Optimal Transport (OT)* defines a set of geometric tools to compare probability measures supported on a metric space. These tools make use of techniques from optimization (linear programming in particular) and differential calculus. They have found multiple application in fields including economics, imaging and fluid mechanics to model transfers, morphings and displacements.

The OT geometry is well known under different names, e.g. *Wasserstein*, *Monge-Kantorovich* or *earth moverâ€™s* distances. The OT geometry has two properties that make it particularly well suited to solve machine learning problems, for which the ability to compare accurately high-dimensional empirical probability measures is often crucial:

- Because OT distances take into account the geometry of the underlying metric space, OT distances are particularly efficient in quantifying the distance between two measures even when their respective supports do not overlap.

- Unlike information divergences (KL, Jensen-Shannon, Hellinger etc.), which are only defined for absolutely continuous measures, the OT distance is well-defined for both discrete (e.g. empirical) and continuous measures, with little dependence on smoothness. In particular, OT distances do not require the painstaking choice of a smoothing kernel bandwidth to preprocess empirical measures.

Despite these strengths, practical applications of OT theory in machine learning have lagged behind theoretical achievements, obtained mostly over the last decade (e.g. Villani's Fields Medal in 2010). This situation can be essentially blamed on the high cost usually associated with the computation of optimal transport.