A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data

  title={A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data},
  author={Martin Slawski and Guoqing Diao and Emanuel Ben-David},
  journal={Journal of Computational and Graphical Statistics},
  pages={991 - 1003}
Abstract Recently, there has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters. In this article, we present a method to adjust for such mismatches under “partial shuffling” in which a… 

Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

A flexible exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and locally constrained shuffling is proposed and compares favorably to competing methods.

Linear regression with partially mismatched data: local search with theoretical guarantees

This paper proposes and studies a simple greedy local search algorithm for an important variant of linear regression in which the predictor-response pairs are partially mismatched, and proves an upper bound for the estimation error of the parameter.

Estimation in exponential family Regression based on linked data contaminated by mismatch error.

A method based on observation-specific offsets to account for potential mismatches and $\ell_1$-penalization is proposed, and its statistical properties are discussed.

Regression with Label Permutation in Generalized Linear Model

This paper presents a relatively complete analysis of label permutation problem for the generalized linear model with multivariate responses, and proposes two methods, “maximum likelihood estimation” algorithm and “two-step estimation’ algorithm, to accommodate for different settings.

Linear regression with unmatched data: a deconvolution perspective

An estimator of the regression vector based on deconvolution is introduced and demonstrated its consistency and asymptotic normality under an identifiability assumption and a method for semi-supervised learning is devised.

Regression with linked datasets subject to linkage error

An account of developments in methodology for dealing with linkage errors in regression analysis with linked datasets, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem is given.

Linear Regression Without Correspondences via Concave Minimization

The resulting algorithm outperforms state-of-the-art methods for fully shuffled data and remains tractable for up to 8-dimensional signals, an untouched regime in prior work.

Global Linear and Local Superlinear Convergence of IRLS for Non-Smooth Robust Regression

We advance both the theory and practice of robust (cid:96) p -quasinorm regression for p ∈ (0 , 1] by using novel variants of iteratively reweighted least-squares (IRLS) to solve the underlying

Reconstruction of Multivariate Sparse Signals from Mismatched Samples

This work proposes a novel robust two-step approach for the reconstruction of shuffled sparse signals and shows that under the assumption that the signals of interest admit a sparse representation over an overcomplete dictionary, unique signal recovery is possible.

Unlabeled Principal Component Analysis

It is shown that a permutation-invariant system of polynomial equations has finitely many solutions, with each solution corresponding to a row permutation of the ground-truth data matrix.



A Two-Stage Approach to Multivariate Linear Regression with Sparsely Mismatched Data

It is shown that the conditions for permutation recovery become considerably less stringent as the number of responses £m per observation increase, and the required signal-to-noise ratio no longer depends on the sample size $n$.

Linear regression with sparsely permuted data

This paper considers the common scenario of "sparsely permuted data" in which only a small fraction of the data is affected by a mismatch between response and predictors and proposes an approach to treat permutedData as outliers which motivates the use of robust regression formulations to estimate the regression parameter.

A Sparse Representation-Based Approach to Linear Regression with Partially Shuffled Labels

It turns out that in this situation, estimation of the regression parameter on the one hand and recovery of the underlying permutation on the other hand can be decoupled so that the computational hardness associated with the latter can be sidestepped.

Stochastic EM for Shuffled Linear Regression

This work proposes a framework that treats the unknown permutation as a latent variable and maximize the likelihood of observations using a stochastic expectation-maximization (EM) approach, and shows on synthetic data that the Stochastic EM algorithm developed has several advantages, including lower parameter error, less sensitivity to the choice of initialization, and significantly better performance on datasets that are only partially shuffled.

Linear Regression with Shuffled Labels

This work proposes several estimators that recover the weights of a noisy linear model from labels that are shuffled by an unknown permutation, and shows that the analog of the classical least-squares estimator produces inconsistent estimates in this setting.

Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

A three-step algorithm in which the parameters are initialize by solving an orthogonal Procrustes problem to estimate a translation matrix ignoring the mismatch, and a mapping matrix aiming to correct the mismatch is estimated using hard-thresholding to induce sparsity, while incorporating potential group information.

Linear regression without correspondence

This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time

A mixture model for the analysis of data derived from record linkage

This paper proposes a mixture model where the indicator whether records belong to the same individual as missing is treated as a pairwise pseudo‐likelihood, and applies the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier.

Hypothesis test for normal mixture models: The EM approach

Normal mixture distributions are arguably the most important mixture models, and also the most technically challenging. The likelihood function of the normal mixture model is unbounded based on a set