Hybridization of Active Learning and Data Programming for Labeling Large Industrial Datasets

  title={Hybridization of Active Learning and Data Programming for Labeling Large Industrial Datasets},
  author={Mona Nashaat and Aindrila Ghosh and James Miller and Shaikh Quader and Chad Marston and J. Puget},
  journal={2018 IEEE International Conference on Big Data (Big Data)},
Modern machine learning (ML) models are being used heavily in business domains to build effective decision support systems. [...] Key Method We use traditional active learning and data programming techniques as baselines to compare the performance and annotation cost of our proposed approach. The results show that the proposed method can achieve higher labeling accuracy than data programming. It also can minimize the labeling cost in real-world business scenarios, while delivering a comparable level of…Expand
Iterative Data Programming for Expanding Text Classification Corpora
This work presents a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models with minimal supervision, and employs an iterative procedure to identify sparsely distributed examples from large volumes of unlabeled data. Expand
Active WeaSuL: Improving Weak Supervision with Active Learning
A modification of the weak supervision loss function, such that the expert-labelled data inform and improve the combination of weak labels, and the maxKL divergence sampling strategy, which determines for which data points expert labelling is most beneficial, are made. Expand
Asterisk: Generating Large Training Datasets with Automatic Active Supervision
This work proposes techniques to generate training data with minimal annotation effort using weak supervision and active learning, which helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. Expand
Sampling Approach Matters: Active Learning for Robotic Language Acquisition
It is observed that representativeness, along with diversity, is crucial in selecting data samples, and a method for analyzing the complexity of data in this joint problem space is presented. Expand
Tensor-decomposition-based unsupervised feature extraction applied to prostate cancer multiomics data
TD-based unsupervised FE was demonstrated to be not only the superior feature selection method but also the method that can select biologically reliable genes in this variety of multiomics measurements. Expand
Tensor-Decomposition-Based Unsupervised Feature Extraction Applied to Prostate Cancer Multiomics Data
TD-based unsupervised FE was demonstrated to be not only the superior feature selection method but also the method that can select biologically reliable genes in this variety of multiomics measurements. Expand
Exploring Inspiration Sets in a Data Programming Pipeline for Product Moderation
We carry out a case study on the use of data programming to create data to train classifiers used for product moderation on a large e-commerce platform. Data programming is a recently-introducedExpand
M-Lean: An end-to-end development framework for predictive models in B2B scenarios
MLean is presented, an end- to-end framework that aims at guiding businesses in designing, developing, evaluating, and deploying business-to-business predictive systems and employs the Lean Startup methodology to maximize the business value while eliminating wasteful development practices. Expand
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
This work develops the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic, demonstrating that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. Expand


Snorkel: Rapid Training Data Creation with Weak Supervision
Snorkel is a first-of-its-kind system that enables users to train state- of- the-art models without hand labeling any training data and proposes an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. Expand
Reef : Automating Weak Supervision to Label Training Data
As deep learning models are applied to increasingly diverse and complex problems, a key bottleneck is gathering enough highquality training labels tailored to each task. Users therefore turn to weakExpand
Data Programming: Creating Large Training Sets, Quickly
A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. Expand
Big active learning
This work proposes the APRAL and MLP strategies so that the computation for active learning can be dramatically reduced while keeping the model power more or less the same. Expand
Active Learning Literature Survey
This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. Expand
Active learning with support vector machines
Different query strategies for selecting informative data points are discussed and how these strategies give rise to different variants of active learning with SVMs are reviewed. Expand
Label-and-Learn: Visualizing the Likelihood of Machine Learning Classifier's Success During Data Labeling
Through a Label-and-Learn interface, this paper explores visualization strategies that leverage the data labeling task to enhance developers' knowledge about their dataset, including the likely success of the classifiers and the rationale behind the classifier's decisions. Expand
Learning the Structure of Generative Models without Labeled Data
This work proposes a structure estimation method that maximizes the ℓ 1-regularized marginal pseudolikelihood of the observed data and shows that the amount of unlabeled data required to identify the true structure scales sublinearly in the number of possible dependencies for a broad class of models. Expand
Re-Active Learning: Active Learning with Relabeling
This paper shows how traditional active learning methods perform poorly at re-active learning, presents new algorithms designed for this important problem, formally characterize their behavior, and empirically shows that their methods effectively make this tradeoff. Expand
Active learning for large multi-class problems
This paper introduces a probabilistic variant of the K-nearest neighbor method for classification that can be seamlessly used for active learning in multi-class scenarios and uses this measure of uncertainty to actively sample training examples that maximize discriminating capabilities of the model. Expand