• Corpus ID: 8649027

Learning with a Wasserstein Loss

  title={Learning with a Wasserstein Loss},
  author={Charlie Frogner and Chiyuan Zhang and Hossein Mobahi and Mauricio Araya-Polo and Tomaso A. Poggio},
Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is… 

The Wasserstein Loss Function

This project would like to explore the properties of this Wasserstein Loss function by comparing its accuracy, convergence rates etc. against other loss functions, and by evaluating how changes in parameters and the distance metric impact its performance.

The Cramer Distance as a Solution to Biased Wasserstein Gradients

This paper describes three natural properties of probability divergences that it believes reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients and proposes an alternative to the Wasserstein metric, the Cramer distance, which possesses all three desired properties.

Wasserstein of Wasserstein Loss for Learning Generative Models

The Wasserstein distance serves as a loss function for unsupervised learning which depends on the choice of a ground metric on sample space and the new formulation is more robust to the natural variability of images and provides for a more continuous discriminator in sample space.

A Simulated Annealing Based Inexact Oracle for Wasserstein Loss Minimization

A stochastic approach based on simulated annealing for solving WLMs is introduced and a Gibbs sampler is developed to approximate effectively and efficiently the partial gradients of a sequence of Wasserstein losses.

Wasserstein Distance Measure Machines

A distance-based discriminative framework for learning with probability distributions is presented and it is proved that, for some learning problems, Wasserstein distance achieves low-error linear decision functions with high probability.

Heterogeneous Wasserstein Discrepancy for Incomparable Distributions

A novel extension of Wasserstein distance is proposed to compare two incomparable distributions, that hinges on the idea of distributional slicing, embeddings, and on computing the closed-form Wassertein distance between the sliced distributions.

The Fisher-Rao Loss for Learning under Label Noise

It is argued that the Fisher-Rao loss provides a natural trade-off between robustness and training dynamics, and Numerical experiments with synthetic and MNIST datasets illustrate this performance.

Quantifying the Empirical Wasserstein Distance to a Set of Measures: Beating the Curse of Dimensionality

The formulation provides insights that help clarify why the Wasserstein distance enjoys favorable empirical performance across a wide range of statistical applications and establishes a strong duality result that generalizes the celebrated Kantorovich-Rubinstein duality.

Wasserstein Training of Restricted Boltzmann Machines

This work proposes a novel approach for Boltzmann machine training which assumes that a meaningful metric between observations is known, and derives a gradient of that distance with respect to the model parameters from the Kullback-Leibler divergence.

The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation

Two Unbalanced Gromov-Wasserstein formulations are introduced: a distance and a more computationally tractable upper-bounding relaxation that allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries.



A Smoothed Dual Approach for Variational Wasserstein Problems

The dual formulation of Wasserstein variational problems introduced recently can be regularized using an entropic smoothing, which leads to smooth, differentiable, convex optimization problems that are simpler to implement and numerically more stable.

Wasserstein Propagation for Semi-Supervised Learning

This paper introduces a technique for graph-based semi-supervised learning of histograms, derived from the theory of optimal transportation, which can be used for histograms on non-standard domains like circles and extends to related problems such as smoothing distributions on graph nodes.

Kernels for Vector-Valued Functions: a Review

This monograph reviews different methods to design or learn valid kernel functions for multiple outputs, paying particular attention to the connection between probabilistic and functional methods.

Fast Computation of Wasserstein Barycenters

The Wasserstein distance is proposed to be smoothed with an entropic regularizer and recover in doing so a strictly convex objective whose gradients can be computed for a considerably cheaper computational cost using matrix scaling algorithms.

Sinkhorn Distances: Lightspeed Computation of Optimal Transport

This work smooths the classic optimal transport problem with an entropic regularization term, and shows that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers.

Optimal Decisions from Probabilistic Models: The Intersection-over-Union Case

  • S. Nowozin
  • Computer Science
    2014 IEEE Conference on Computer Vision and Pattern Recognition
  • 2014
This work considers the popular intersection-over-union (IoU) score used in image segmentation benchmarks and shows that it results in a hard combinatorial decision problem, and proposes a statistical approximation to the objective function, as well as an approximate algorithm based on parametric linear programming.

The Earth Mover's Distance as a Metric for Image Retrieval

This paper investigates the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval, and compares the retrieval performance of the EMD with that of other distances.

Fully convolutional networks for semantic segmentation

The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

Unbalanced Optimal Transport: Geometry and Kantorovich Formulation

This article presents a new class of "optimal transportation"-like distances between arbitrary positive Radon measures. These distances are defined by two equivalent alternative formulations: (i) a

Comparing Clusterings in Space

A new measure for comparing clusterings is formulated that combines spatial and partitional information into a single measure using optimization theory, which eliminates pathological conditions in previous approaches.