• Corpus ID: 240419893

Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

  title={Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group},
  author={Zhenbang Wang and Emanuel Ben-David and Martin Slawski},
In the analysis of data sets consisting of (X,Y )-pairs, a tacit assumption is that each pair corresponds to the same observation unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this setting under the term “Shuffled data” in which the underlying correct pairing of (X,Y )pairs is… 

Figures and Tables from this paper


A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data
A method to adjust for such mismatches under “partial shuffling” in which a sufficiently large fraction of (predictors, response)-pairs are observed in their correct correspondence is presented, based on a pseudo-likelihood in which each term takes the form of a two-component mixture density.
Stochastic EM for Shuffled Linear Regression
This work proposes a framework that treats the unknown permutation as a latent variable and maximize the likelihood of observations using a stochastic expectation-maximization (EM) approach, and shows on synthetic data that the Stochastic EM algorithm developed has several advantages, including lower parameter error, less sensitivity to the choice of initialization, and significantly better performance on datasets that are only partially shuffled.
Linear Regression with Shuffled Labels
This work proposes several estimators that recover the weights of a noisy linear model from labels that are shuffled by an unknown permutation, and shows that the analog of the classical least-squares estimator produces inconsistent estimates in this setting.
Linear regression with sparsely permuted data
This paper considers the common scenario of "sparsely permuted data" in which only a small fraction of the data is affected by a mismatch between response and predictors and proposes an approach to treat permutedData as outliers which motivates the use of robust regression formulations to estimate the regression parameter.
A Sparse Representation-Based Approach to Linear Regression with Partially Shuffled Labels
It turns out that in this situation, estimation of the regression parameter on the one hand and recovery of the underlying permutation on the other hand can be decoupled so that the computational hardness associated with the latter can be sidestepped.
A Two-Stage Approach to Multivariate Linear Regression with Sparsely Mismatched Data
It is shown that the conditions for permutation recovery become considerably less stringent as the number of responses £m per observation increase, and the required signal-to-noise ratio no longer depends on the sample size $n$.
Estimation in exponential family Regression based on linked data contaminated by mismatch error.
A method based on observation-specific offsets to account for potential mismatches and $\ell_1$-penalization is proposed, and its statistical properties are discussed.
An Algebraic-Geometric Approach to Shuffled Linear Regression
Using the machinery of algebraic geometry it is proved that as long as the independent samples are generic, this polynomial system is always consistent with at most $n!$ complex roots, regardless of any type of corruption inflicted on the observations.
Fourier Theoretic Probabilistic Inference over Permutations
This paper uses the "low-frequency" terms of a Fourier decomposition to represent distributions over permutations compactly, and presents Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations.
Optimal Estimator for Unlabeled Linear Regression
This paper proposes a one-step estimator which is optimal from both the computational and the statistical aspects of unlabeled linear regression and exhibits the same order of computational complexity as that of the oracle case.