Snorkel: rapid training data creation with weak supervision

@article{Ratner2017SnorkelRT,
  title={Snorkel: rapid training data creation with weak supervision},
  author={Alexander J. Ratner and Stephen H. Bach and Henry R. Ehrenberg and Jason Alan Fries and Sen Wu and Christopher R{\'e}},
  journal={The VLDB Journal},
  year={2017},
  volume={29},
  pages={709 - 730}
}
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. [] Key Method We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an…

Snuba: Automating Weak Supervision to Label Training Data

Snuba, a system to automatically generate heuristics using a small labeled dataset to assign training labels to a large, unlabeled dataset in the weak supervision setting, and develops a statistical measure that guarantees the iterative process will automatically terminate before it degrades training label quality.

Software 2.0 and Snorkel: Beyond Hand-Labeled Data

Snorkel is described, a system that enables users to help shape, create, and manage training data for Software 2.0 stacks and shows that estimating and accounting for the quality of the labeling functions in this way can lead to improved training set labels and boost downstream application quality-potentially by large margins.

AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels

The central question of AutoWS-Bench-101 is whether a practitioner should use an AutoWS method to generate additional labels or use some simpler baseline, such as zero-shot predictions from a foundation model or supervised learning, which suggests that it combined with AutoWS methods.

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

This work develops the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic, demonstrating that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels.

TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration

A novel tool is built, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming and encourages them to explore trade-offs between different labeling models and active learning strategies.

XPASC: Measuring Generalization in Weak Supervision

A novel method, XPASC (eXPlainability-Association SCore), is introduced for measuring the generalization of a model trained with a weakly supervised dataset and it is shown that generalization and performance do not relate one-to-one, and that the highest degree of generalization does not necessarily imply the best performance.

Witan

This paper proposes Witan, an algorithm for generating labelling functions without any initial supervision, which affords many interaction modes, including unsupervised dataset exploration before the user even defines a set of classes.

Learning from Multiple Noisy Partial Labelers

This work introduces a probabilistic generative model that can estimate the underlying accuracies of multiple noisy partial labelers without ground truth labels, and has accuracy comparable to recent embedding-based zero-shot learning methods, while using only pre-trained attribute detectors.

Witan: Unsupervised Labelling Function Generation for Assisted Data Programming

This paper proposes Witan, an algorithm for generating labelling functions without any initial supervision, which affords many interaction modes, including unsupervised dataset exploration before the user even defines a set of classes.

Adaptive Rule Discovery for Labeling Text Data

DARWIN, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings, is presented and it is demonstrated that rules discovered by DARWIN on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.
...

References

SHOWING 1-10 OF 68 REFERENCES

Snuba: Automating Weak Supervision to Label Training Data

Snuba, a system to automatically generate heuristics using a small labeled dataset to assign training labels to a large, unlabeled dataset in the weak supervision setting, and develops a statistical measure that guarantees the iterative process will automatically terminate before it degrades training label quality.

Data Programming: Creating Large Training Sets, Quickly

A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.

Snorkel MeTaL: Weak Supervision for Multi-Task Learning

This work proposes Snorkel MeTaL, an end-to-end system for multi-task learning that leverages weak supervision provided at multiple levels of granularity by domain expert users.

Training Complex Models with Multi-Task Weak Supervision

This work shows that by solving a matrix completion-style problem, it can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model.

Learning the Structure of Generative Models without Labeled Data

This work proposes a structure estimation method that maximizes the ℓ 1-regularized marginal pseudolikelihood of the observed data and shows that the amount of unlabeled data required to identify the true structure scales sublinearly in the number of possible dependencies for a broad class of models.

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

A first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introducing Snorkel DryBell, a new weak supervision management system for this setting.

Active Learning

The key idea behind active learning is that a machine learning algorithm can perform better with less training if it is allowed to choose the data from which it learns. An active learner may pose

Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

This work introduces a new approach to supervising neural networks by specifying constraints that should hold over the output space, rather than direct examples of input-output pairs, derived from prior domain knowledge.

Combining labeled and unlabeled data with co-training

A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.

Active Learning Literature Survey

This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.
...