People mover's distance: Class level geometry using fast pairwise data adaptive transportation costs

@article{Cloninger2019PeopleMD,
  title={People mover's distance: Class level geometry using fast pairwise data adaptive transportation costs},
  author={Alexander Cloninger and Brita Roy and Carley Riley and Harlan M. Krumholz},
  journal={Applied and Computational Harmonic Analysis},
  year={2019}
}
We address the problem of defining a network graph on a large collection of classes. Each class is comprised of a collection of data points, sampled in a non i.i.d. way, from some unknown underlying distribution. The application we consider in this paper is a large scale high dimensional survey of people living in the US, and the question of how similar or different are the various counties in which these people live. We use a co-clustering diffusion metric to learn the underlying distribution… Expand

Figures from this paper

Linear Optimal Transport Embedding: Provable fast Wasserstein distance computation and classification for nonlinear problems
TLDR
This paper characterize a number of settings in which LOT embeds families of distributions into a space in which they are linearly separable, and proves conditions under which the distance of the LOT embedding between two distributions in arbitrary dimension is nearly isometric to Wasserstein-2 distance between those distributions. Expand
A low discrepancy sequence on graphs
TLDR
This work describes a construction of a sampling scheme analogous to the so called Leja points in complex potential theory that can be proved to give low discrepancy estimates for the approximation of the expected value by the impirical expected value based on these points. Expand

References

SHOWING 1-10 OF 22 REFERENCES
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces
TLDR
This paper examines the behavior of the commonly used L k norm and shows that the problem of meaningfulness in high dimensionality is sensitive to the value of k, which means that the Manhattan distance metric is consistently more preferable than the Euclidean distance metric for high dimensional data mining applications. Expand
The Earth Mover's Distance as a Metric for Image Retrieval
TLDR
This paper investigates the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval, and compares the retrieval performance of the EMD with that of other distances. Expand
Earth Mover ’ s Distance and Equivalent Metrics for Spaces with Hierarchical Partition trees
partition tree, and prove their equivalence. Similar metrics have previously been defined in more restrictive settings; in particular, the well-known Earth Mover’s Distance is widely used in machineExpand
Approximate earth mover’s distance in linear time
TLDR
It is experimentally show that wavelet EMD is a good approximation to EMD, has similar performance, but requires much less computation, while the comparison is about as fast as for normal Euclidean distance or chi2 statistic. Expand
Diffusion maps
In this paper, we provide a framework based upon diffusion processes for finding meaningful geometric descriptions of data sets. We show that eigenfunctions of Markov matrices can be used toExpand
Understanding bag-of-words model: a statistical framework
TLDR
A statistical framework which generalizes the bag-of-words representation, in which the visual words are generated by a statistical process rather than using a clustering algorithm, while the empirical performance is competitive to clustering-based method. Expand
Pattern Classification
Classification • Supervised – parallelpiped – minimum distance – maximum likelihood (Bayes Rule) > non-parametric > parametric – support vector machines – neural networks – context classification •Expand
Hölder–Lipschitz Norms and Their Duals on Spaces with Semigroups, with Applications to Earth Mover’s Distance
We introduce a family of bounded, multiscale distances on any space equipped with an operator semigroup. In many examples, these distances are equivalent to a snowflake of the natural distance on theExpand
Sampling, denoising and compression of matrices by coherent matrix organization
Abstract The need to organize and analyze real-valued matrices arises in various settings – notably, in data analysis (where matrices are multivariate data sets) and in numerical analysis (whereExpand
Topic modeling: beyond bag-of-words
TLDR
A hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model is explored. Expand
...
1
2
3
...