# A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data

@article{Slawski2019APA,
title={A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data},
author={Martin Slawski and Guoqing Diao and Emanuel Ben-David},
journal={Journal of Computational and Graphical Statistics},
year={2019},
volume={30},
pages={991 - 1003}
}
• Published 3 October 2019
• Computer Science, Mathematics
• Journal of Computational and Graphical Statistics
Abstract Recently, there has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters. In this article, we present a method to adjust for such mismatches under “partial shuffling” in which a…
15 Citations
• Computer Science, Mathematics
ArXiv
• 2021
A flexible exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and locally constrained shuffling is proposed and compares favorably to competing methods.
• Computer Science, Mathematics
Mathematical Programming
• 2022
This paper proposes and studies a simple greedy local search algorithm for an important variant of linear regression in which the predictor-response pairs are partially mismatched, and proves an upper bound for the estimation error of the parameter.
• Computer Science
• 2020
A method based on observation-specific offsets to account for potential mismatches and $\ell_1$-penalization is proposed, and its statistical properties are discussed.
• Computer Science, Mathematics
• 2022
This paper presents a relatively complete analysis of label permutation problem for the generalized linear model with multivariate responses, and proposes two methods, “maximum likelihood estimation” algorithm and “two-step estimation’ algorithm, to accommodate for diﬀerent settings.
• Mathematics, Computer Science
• 2022
An estimator of the regression vector based on deconvolution is introduced and demonstrated its consistency and asymptotic normality under an identifiability assumption and a method for semi-supervised learning is devised.
• Computer Science
• 2021
An account of developments in methodology for dealing with linkage errors in regression analysis with linked datasets, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem is given.
• Computer Science
IEEE Signal Processing Letters
• 2020
The resulting algorithm outperforms state-of-the-art methods for fully shuffled data and remains tractable for up to 8-dimensional signals, an untouched regime in prior work.
• Mathematics
• 2022
We advance both the theory and practice of robust (cid:96) p -quasinorm regression for p ∈ (0 , 1] by using novel variants of iteratively reweighted least-squares (IRLS) to solve the underlying
• Computer Science
• 2022
This work proposes a novel robust two-step approach for the reconstruction of shufﬂed sparse signals and shows that under the assumption that the signals of interest admit a sparse representation over an overcomplete dictionary, unique signal recovery is possible.
• Mathematics
NeurIPS
• 2021
It is shown that a permutation-invariant system of polynomial equations has finitely many solutions, with each solution corresponding to a row permutation of the ground-truth data matrix.

## References

SHOWING 1-10 OF 55 REFERENCES

• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2020
It is shown that the conditions for permutation recovery become considerably less stringent as the number of responses £m per observation increase, and the required signal-to-noise ratio no longer depends on the sample size $n$.
• Mathematics, Computer Science
Electronic Journal of Statistics
• 2019
This paper considers the common scenario of "sparsely permuted data" in which only a small fraction of the data is affected by a mismatch between response and predictors and proposes an approach to treat permutedData as outliers which motivates the use of robust regression formulations to estimate the regression parameter.
• Computer Science, Mathematics
UAI
• 2019
It turns out that in this situation, estimation of the regression parameter on the one hand and recovery of the underlying permutation on the other hand can be decoupled so that the computational hardness associated with the latter can be sidestepped.
• Computer Science
ArXiv
• 2018
This work proposes a framework that treats the unknown permutation as a latent variable and maximize the likelihood of observations using a stochastic expectation-maximization (EM) approach, and shows on synthetic data that the Stochastic EM algorithm developed has several advantages, including lower parameter error, less sensitivity to the choice of initialization, and significantly better performance on datasets that are only partially shuffled.
• Computer Science, Mathematics
• 2017
This work proposes several estimators that recover the weights of a noisy linear model from labels that are shuffled by an unknown permutation, and shows that the analog of the classical least-squares estimator produces inconsistent estimates in this setting.
• Computer Science
Journal of the American Statistical Association
• 2020
A three-step algorithm in which the parameters are initialize by solving an orthogonal Procrustes problem to estimate a translation matrix ignoring the mismatch, and a mapping matrix aiming to correct the mismatch is estimated using hard-thresholding to induce sparsity, while incorporating potential group information.
• Mathematics, Computer Science
NIPS
• 2017
This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time
• Computer Science
Statistics in medicine
• 2015
This paper proposes a mixture model where the indicator whether records belong to the same individual as missing is treated as a pairwise pseudo‐likelihood, and applies the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier.
• Mathematics
• 2009
Normal mixture distributions are arguably the most important mixture models, and also the most technically challenging. The likelihood function of the normal mixture model is unbounded based on a set