Text Classification Using Label Names Only: A Language Model Self-Training Approach

@inproceedings{Meng2020TextCU,
  title={Text Classification Using Label Names Only: A Language Model Self-Training Approach},
  author={Yu Meng and Yunyi Zhang and Jiaxin Huang and Chenyan Xiong and Heng Ji and Chao Zhang and Jiawei Han},
  booktitle={EMNLP},
  year={2020}
}
Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled… 

Figures and Tables from this paper

TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names

TLDR
This paper proposes a novel HMTC framework, named TaxoClass, which calculates document-class similarities using a textual entailment model, identifies a document’s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and generalizes the classifier via multi-label self-training.

Using Pseudo-Labelled Data for Zero-Shot Text Classification

TLDR
This paper proposes an approach called P-ZSC to leverage pseudo-labelled data for hot text classification through a matching algorithm between the unlabelled target-domain corpus and the label vocabularies that consist of in-domain relevant phrases via expansion from label names.

X-Class: Text Classification with Extremely Weak Supervision

TLDR
This paper proposes a novel framework X-Class, which can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets and obtains the document representation via a weighted average of contextualized word representations.

Weakly Supervised Text Classification using Supervision Signals from a Language Model

TLDR
A latent variable model is proposed to learn a word distribution learner which associates generated words to pre-defined categories and a document classifier simultaneously without using any annotated data.

MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information

TLDR
A novel framework, named MotifClass, which retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances, and trains a text classifier using the pseudo training data and the benefit of considering higher-order metadata information in the framework is shown.

Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation

TLDR
A comprehensive evaluation of six binary classification tasks on four popular datasets demonstrates that the proposed method outperforms a baseline using only category name seed words and obtained comparable performance as a counterpart using expert-annotated seed words.

Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification

TLDR
Experimental results show that MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines and is on par with the state-of-the-art supervised metadata-aware LMTC method trained on 10K–200K labeled documents, and tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.

Weakly Supervised Prototype Topic Model with Discriminative Seed Words: Modifying the Category Prior by Self-exploring Supervised Signals

TLDR
This paper proposes a novel formulation of category prior, namely Weakly Supervised Prototype Topic Model (Wsptm), and suggests a novel generative dataless method, namely Wsptm, which outperforms the existing baseline methods.

Contrast-Enhanced Semi-supervised Text Classification with Few Labels

TLDR
A certainty-driven sample selection method and a contrast-enhanced similarity graph are proposed to utilize data more efficiently in self-training, alleviating the annotation-starving problem.

Towards Open-Domain Topic Classification

TLDR
An open-domain topic classification system that accepts user-defined taxonomy in real time that significantly improves over existing zero-shot baselines in open- domain scenarios, and performs competitively with weakly-supervised models trained on in-domain data.
...

References

SHOWING 1-10 OF 51 REFERENCES

All-in Text: Learning Document, Label, and Word Representations Jointly

TLDR
The potential of this method on the multi-label classification task of assigning keywords from the Medical Subject Headings to publications in biomedical research, both in a conventional and in a zero-shot learning setting is demonstrated.

Importance of Semantic Representation: Dataless Classification

TLDR
This paper introduces Dataless Classification, a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data, and proposes a model for dataless classification and shows that the label name alone is often sufficient to induceclassifiers.

Dataless Text Classification with Descriptive LDA

TLDR
A novel kind of model, descriptive LDA (DescLDA), which performs DLTC with only category description words and unlabeled documents is proposed, which is more effective than the semantic-based DLTC baseline method and the accuracy of the method is very close to state-of-the-art supervised text classification methods.

Train Once, Test Anywhere: Zero-Shot Learning for Text Classification

TLDR
This work proposes a Zero-shot Learning approach for text categorization by training model on a large corpus of sentences to learn the relationship between a sentence and embedding of sentence's tags and shows that the models generalize well across new unseen classes in both cases.

Weakly-Supervised Hierarchical Text Classification

TLDR
This paper proposes a weakly-supervised neural method for hierarchical text classification that features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism.

Contextualized Weak Supervision for Text Classification

TLDR
A novel framework ConWea is proposed, providing contextualized weak supervision for text classification, and leveraging contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of the same word, and thus create a contextualized corpus.

Minimally Supervised Categorization of Text with Metadata

TLDR
MetaCat is proposed, a minimally supervised framework to categorize text with metadata that develops a generative process describing the relationships between words, documents, labels, and metadata and embeds text and metadata into the same semantic space to encode heterogeneous signals.

On Dataless Hierarchical Text Classification

TLDR
The results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples and how to improve the classification using bootstrapping.

Learning Word Vectors for Sentiment Analysis

TLDR
This work presents a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term--document information as well as rich sentiment content, and finds it out-performs several previously introduced methods for sentiment classification.

Discriminative Topic Mining via Category-Name Guided Text Embedding

TLDR
CatE is developed, a novel category-name guided text embedding method for discriminative topic mining, which effectively leverages minimal user guidance to learn a discrim inative embedding space and discover category representative terms in an iterative manner.
...