• Corpus ID: 209386692

Pairwise Feedback for Data Programming

@article{Boecking2019PairwiseFF,
  title={Pairwise Feedback for Data Programming},
  author={Benedikt Boecking and Artur W. Dubrawski},
  journal={ArXiv},
  year={2019},
  volume={abs/1912.07685}
}
The scalability of the labeling process and the attainable quality of labels have become limiting factors for many applications of machine learning. The programmatic creation of labeled datasets via the synthesis of noisy heuristics provides a promising avenue to address this problem. We propose to improve modeling of latent class variables in the programmatic creation of labeled datasets by incorporating pairwise feedback into the process. We discuss the ease with which such pairwise feedback… 

Figures and Tables from this paper

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

This work develops the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic, demonstrating that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels.

Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

This work borrowing from weak supervision, wherein models can be trained with noisy sources of signal instead of hand-labeled data, outperforms WS without extension, TL without fine-tuning, and state-of-the-art weakly-supervised deep networks-all while training in less than half a second.

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

This work proposes L IGER, a combination that uses foundation model embeddings to improve two crucial elements of existing weak supervision techniques, and produces finer estimates of weak source quality by partitioning the embedding space and learning per-part source accuracies.

The Word is Mightier than the Label: Learning without Pointillistic Labels using Data Programming

This paper surveys recent work on weak supervision, and in particular, investigates the Data Programming (DP) framework, taking a set of potentially noisy heuristics as input and analyzes the math fundamentals behind DP and demonstrates the power of it by application on two real-world text classification tasks.

Classifying Unstructured Clinical Notes via Automatic Weak Supervision

This work introduces a general weakly-supervised text classification framework that learns from class-label descriptions only, without the need to use any human-labeled documents, and leverages the linguistic domain knowledge stored within pre-trained language models and the data programming framework to assign code labels to individual texts.

References

SHOWING 1-10 OF 26 REFERENCES

Learning the Structure of Generative Models without Labeled Data

This work proposes a structure estimation method that maximizes the ℓ 1-regularized marginal pseudolikelihood of the observed data and shows that the amount of unlabeled data required to identify the true structure scales sublinearly in the number of possible dependencies for a broad class of models.

Training Complex Models with Multi-Task Weak Supervision

This work shows that by solving a matrix completion-style problem, it can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model.

Learning with Feature Feedback: from Theory to Practice

This paper formalizes two models of feature feedback, gives learning algorithms for them, and quantify their usefulness in the learning process, and shows the efficacy of these methods.

Data Programming: Creating Large Training Sets, Quickly

A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.

Clustering-Driven Deep Embedding With Pairwise Constraints

This paper proposes a new framework, called Clustering-driven deep embedding with PAirwise Constraints (CPAC), for nonparametric clustering using a neural network, based on a Siamese network and shows that clustering performance increases when using this scheme, even with a limited amount of user queries.

Nonparametric Regression with Comparisons: Escaping the Curse of Dimensionality with Ordinal Information

This work develops an algorithm called Ranking-Regression (RR) and analyzes its accuracy as a function of size of the labeled and unlabeled datasets and various noise parameters and presents lower bounds, that establish fundamental limits for the task and show that RR is optimal in a variety of settings.

Learning Dependency Structures for Weak Supervision Models

It is shown that the amount of unlabeled data needed can scale sublinearly or even logarithmically with the number of sources, improving over previous efforts that ignore the sparsity pattern in the dependency structure and scale linearly in $m$.

Learning from discriminative feature feedback

An efficient online algorithm is presented for learning a multi-class classifier from labels as well as simple explanations that can be provided whenever the target concept is a decision tree, or more generally belongs to a particular subclass of DNF formulas.

Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

A novel semi-supervised training algorithm developed for this setting is presented, which is fast enough to support real-time interactive speeds, and at least as accurate as preexisting methods for learning with mixed feature and instance labels.

Multi-Resolution Weak Supervision for Sequential Data

Dugong is proposed, the first framework to model multi-resolution weak supervision sources with complex correlations to assign probabilistic labels to training data and it is proved that Dugong, under mild conditions, can uniquely recover the unobserved accuracy and correlation parameters and use parameter sharing to improve sample complexity.