• Corpus ID: 231774377

Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

  title={Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation},
  author={Mayee F. Chen and Benjamin Cohen-Wang and Stephen Mussmann and Frederic Sala and Christopher R'e},
Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a biasvariance decomposition… 

Figures and Tables from this paper

Data Consistency for Weakly Supervised Learning

A novel weak supervision algorithm that processes noisy labels, i.e., weak signals, while also considering features of the training data to produce accurate labels for training, which significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.

Firebolt: Weak Supervision Under Weaker Assumptions

Firebolt is presented, a new weak supervision framework that seeks to operate under weaker assumptions and learns the class balance and class-specific accuracy of LFs jointly from unlabeled data and carries out inference in an e-cient and interpretable manner.

Generative Modeling Helps Weak Supervision (and Vice Versa)

This work proposes a model fusing programmatic weak supervision and generative adversarial networks and provides theoretical justification motivating this fusion, and is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.

End-to-End Weak Supervision

This work proposes an end-to-end approach for directly learning the downstream model by maximizing its agreement with probabilistic labels generated by reparameterizing prior Probabilistic posteriors with a neural network.

Universalizing Weak Supervision

This work proposes a universal technique that enables weak supervision over any label type while still offering desirable properties, including practical flexibility, computational efficiency, and theoretical guarantees.

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

This work proposes L IGER, a combination that uses foundation model embeddings to improve two crucial elements of existing weak supervision techniques, and produces finer estimates of weak source quality by partitioning the embedding space and learning per-part source accuracies.



Learning the Structure of Generative Models without Labeled Data

This work proposes a structure estimation method that maximizes the ℓ 1-regularized marginal pseudolikelihood of the observed data and shows that the amount of unlabeled data required to identify the true structure scales sublinearly in the number of possible dependencies for a broad class of models.

Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning

It is proved that for basic hypothesis classes over the real line, if the distribution of unlabeled data is ‘smooth’, knowledge of that distribution cannot improve the labeled sample complexity by more than a constant factor.

Unlabeled data: Now it helps, now it doesn't

A finite sample analysis is developed that characterizes the value of un-labeled data and quantifies the performance improvement of SSL compared to supervised learning, and shows that there are large classes of problems for which SSL can significantly outperform supervised learning in finite sample regimes and sometimes also in terms of error convergence rates.

Learning Dependency Structures for Weak Supervision Models

It is shown that the amount of unlabeled data needed can scale sublinearly or even logarithmically with the number of sources, improving over previous efforts that ignore the sparsity pattern in the dependency structure and scale linearly in $m$.

The Effect of Model Misspecification on Semi-Supervised Classification

  • Ting YangC. Priebe
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2011
This work examines the effect of model misspecification on semi-supervised classification performance and shed some light on when and why performance degradation occurs, and considers maximum likelihood estimation in finite mixture models and the Bayes plug-in classifier.

Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

FlyingSquid is built, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches and requires fewer assumptions, and proves bounds on generalization error without assuming that the latent variable model can exactly parameterize the underlying data distribution.

Training Complex Models with Multi-Task Weak Supervision

This work shows that by solving a matrix completion-style problem, it can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model.

Estimating Latent-Variable Graphical Models using Moments and Likelihoods

This work shows that using the method of moments in conjunction with composite likelihood yields consistent parameter estimates for a much broader class of discrete directed and undirected graphical models, including loopy graphs with high treewidth.

Data Programming: Creating Large Training Sets, Quickly

A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.

Introduction to Semi-Supervised Learning

This introductory book presents some popular semi-supervised learning models, including self-training, mixture models, co-training and multiview learning, graph-based methods, and semi- supervised support vector machines, and discusses their basic mathematical formulation.