Iterative Data Programming for Expanding Text Classification Corpora

  title={Iterative Data Programming for Expanding Text Classification Corpora},
  author={Neil Rohit Mallinar and Abhishek Shah and Tin Kam Ho and Rajendra Ugrani and Ayush Gupta},
Real-world text classification tasks often require many labeled training examples that are expensive to obtain. Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data sets quickly via a general framework for building weak models, also known as labeling functions, and denoising them through ensemble learning techniques. We present a fast, simple data programming method for augmenting text data sets by generating neighborhood… 

Figures and Tables from this paper

Witan: Unsupervised Labelling Function Generation for Assisted Data Programming

This paper proposes Witan, an algorithm for generating labelling functions without any initial supervision, which affords many interaction modes, including unsupervised dataset exploration before the user even defines a set of classes.

Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

Nemo is presented, an end-to-end interactive system that improves the overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS approach.

Conference Abstracts

The jihadisphere is a vehicle for extreme propaganda, radicalization processes, and recruitment methods jihadist organizations tend to resort to (Brown, 20215; Malešević 2017). Using social identity



An Empirical Study of Active Learning for Text Classification

Four learners are compared under two different Active Learning approaches against Random sampling, examining the efficacy of annotating unlabeled documents that verify specific queries and improving the improved learning behavior of the first case.

Using the Web as corpus for self-training text categorization

A new semi-supervised method for text categorization is proposed, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier.

Data Programming: Creating Large Training Sets, Quickly

A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.

Hybridization of Active Learning and Data Programming for Labeling Large Industrial Datasets

The results show that the proposed method can achieve higher labeling accuracy than data programming, and can minimize the labeling cost in real-world business scenarios, while delivering a comparable level of performance with active learning.

Learning the Structure of Generative Models without Labeled Data

This work proposes a structure estimation method that maximizes the ℓ 1-regularized marginal pseudolikelihood of the observed data and shows that the amount of unlabeled data required to identify the true structure scales sublinearly in the number of possible dependencies for a broad class of models.

Bootstrapping Conversational Agents With Weak Supervision

This paper presents a framework called \textit{search, label, and propagate} (SLP) for bootstrapping intents from existing chat logs using weak supervision, and reports on a user study that shows positive user feedback for this new approach to build conversational agents and demonstrates the effectiveness of using data programming for auto-labeling.

Document Classification Using Expectation Maximization with Semi Supervised Learning

The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and how to improve the accuracy while using semi-supervised approach.

Active Learning Literature Survey

This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.

Snorkel: rapid training data creation with weak supervision

Snorkel is a first-of-its-kind system that enables users to train state-of theart models without hand labeling any training data by incorporating the first end-to-end implementation of the recently proposed machine learning paradigm, data programming.

Rapidly Scaling Dialog Systems with Interactive Learning

This paper shows how interactive learning can be applied to the creation of statistical intent models, and shows that intent detectors can be built using interactive learning and then improved in a novel end-to-end visualization tool.