Efficient Path Prediction for Semi-Supervised and Weakly Supervised Hierarchical Text Classification

  title={Efficient Path Prediction for Semi-Supervised and Weakly Supervised Hierarchical Text Classification},
  author={Huiru Xiao and Xin Liu and Yangqiu Song},
  journal={The World Wide Web Conference},
Hierarchical text classification has many real-world applications. However, labeling a large number of documents is costly. In practice, we can use semi-supervised learning or weakly supervised learning (e.g., dataless classification) to reduce the labeling cost. In this paper, we propose a path cost-sensitive learning algorithm to utilize the structural information and further make use of unlabeled and weakly-labeled data. We use a generative model to leverage the large amount of unlabeled… 

Figures and Tables from this paper

Hierarchical Metadata-Aware Document Categorization under Weak Supervision
This paper proposes a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and introduces a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names
This paper proposes a novel HMTC framework, named TaxoClass, which calculates document-class similarities using a textual entailment model, identifies a document’s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and generalizes the classifier via multi-label self-training.
Minimally Supervised Categorization of Text with Metadata
MetaCat is proposed, a minimally supervised framework to categorize text with metadata that develops a generative process describing the relationships between words, documents, labels, and metadata and embeds text and metadata into the same semantic space to encode heterogeneous signals.
Bag-of-Words vs. Sequence vs. Graph vs. Hierarchy for Single- and Multi-Label Text Classification
It is shown that a simple multi-layer perceptron (MLP) using a Bag of Words (BoW) outperforms the recent graph-based models TextGCN and HeteGCN in an inductive text classification setting and is comparable with HyperGAT in single-label classi-cation.
Bag-of-Words vs. Graph vs. Sequence in Text Classification: Questioning the Necessity of Text-Graphs and the Surprising Strength of a Wide MLP
It is shown that a wide multi-layer perceptron (MLP) using a Bag-of-Words (BoW) outperforms the recent graph-based models TextGCN and HeteGCN in an inductive text classification setting and is comparable with HyperGAT.
HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories
The HiGitClass framework is proposed, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation, which is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification.
Forget me not: A Gentle Reminder to Mind the Simple Multi-Layer Perceptron Baseline for Text Classification
It is shown that already a simple MLP baseline achieves comparable performance on benchmark datasets, questioning the importance of synthetic graph structures and providing recommendations for the design and training of such a baseline.


Weakly-Supervised Hierarchical Text Classification
This paper proposes a weakly-supervised neural method for hierarchical text classification that features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism.
Weakly-Supervised Neural Text Classification
This paper proposes a weakly-supervised method that addresses the lack of training data in neural text classification and achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.
Semi-supervised Text Classification Using Partitioned EM
This paper proposes a clustering based partitioning technique that first partitions the training documents in a hierarchical fashion using hard clustering, and prunes the tree using the labeled data after running the expectation maximization algorithm in each partition.
Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN
A graph-CNN based deep learning model is proposed to first convert texts to graph-of-words, and then use graph convolution operations to convolve the word graph and regularize the deep architecture with the dependency among labels.
Recursive regularization for large-scale classification with hierarchical and graphical dependencies
This paper proposes a regularization framework for large-scale hierarchical classification that incorporates the hierarchical dependencies between the class-labels into the regularization structure of the parameters thereby encouraging classes nearby in the hierarchy to share similar model parameters.
Importance of Semantic Representation: Dataless Classification
This paper introduces Dataless Classification, a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data, and proposes a model for dataless classification and shows that the label name alone is often sufficient to induceclassifiers.
HierCost: Improving Large Scale Hierarchical Classification with Cost Sensitive Learning
This work adopts a cost-sensitive classification approach to the hierarchical classification problem by defining misclassification cost based on the hierarchy, which effectively decouples the models for various classes, allowing for efficiently train effective models for large hierarchies in a distributed fashion.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies
This paper builds such exploratory learning methods for hierarchical classification tasks with subsets of the NELL ontology and text, and HTML table datasets derived from the ClueWeb09 corpus, and outperforms the existing Exploratory EM method, and its naive extension, in terms of seed class F1 on average by 10% and 7% respectively.
Semi-Supervised Text Classification Using EM
Deterministic annealing, a variant of EM, can help overcome the problem of local maxima and increase classification accuracy further when the generative model is appropriate.
Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields
This paper presents a semi-supervised training method for linear-chain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using