• Corpus ID: 12505054

Subset Labeled LDA for Large-Scale Multi-Label Classification

  title={Subset Labeled LDA for Large-Scale Multi-Label Classification},
  author={Yannis Papanikolaou and Grigorios Tsoumakas},
Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address multi-label learning tasks. Previous work has shown it to perform in par with other state-of-the-art multi-label methods. Nonetheless, with increasing label sets sizes LLDA encounters scalability issues. In this work, we introduce Subset LLDA, a simple variant of the standard LLDA algorithm, that not only can effectively scale up to problems with… 

Tables from this paper

Data scarcity, robustness and extreme multi-label classification

It is shown that minimizing Hamming loss with appropriate regularization surpasses many state-of-the-art methods for tail-labels detection in XMC and the spectral properties of label graphs are investigated for providing novel insights towards understanding the conditions governing the performance of Hamming losses based one-vs-rest scheme.

Adversarial Extreme Multi-label Classification

This work poses the learning task in extreme classification with large number of tail-labels as learning in the presence of adversarial perturbations to motivate a robust optimization framework and equivalence to a corresponding regularized objective.

Event Geoparser with Pseudo-Location Entity Identification and Numerical Extraction in Indonesian News Corpus

A novel type event geoparser is proposed which integrates an ACE-based event extraction model and provides precise event-level scope resolution and also extracts various numerical arguments and able to generate thematic choropleth map from a single news story.

Event Geoparser with Pseudo-Location Entity Identification and Numerical Argument Extraction Implementation and Evaluation in Indonesian News Domain

An event geoparser model with three stages of processing is proposed, which tightly integrates event extraction model into geoparsing and provides precise event-level resolution scope.



DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification

This work presents DiSMEC, which is a large-scale distributed framework for learning one-versus-rest linear classifiers coupled with explicit capacity control to control model size, and conducts extensive empirical evaluation on publicly available real-world datasets consisting upto 670,000 labels.

Random k-Labelsets for Multilabel Classification

Empirical evidence indicates that RAkEL manages to improve substantially over LP, especially in domains with large number of labels and exhibits competitive performance against other high-performing multilabel learning methods.

Statistical topic models for multi-label document classification

The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.

FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning

The objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest algorithm and the Label Partitioning for Sub-linear Ranking algorithm.

Large-scale Multi-label Learning with Missing Labels

This paper studies the multi-label problem in a generic empirical risk minimization (ERM) framework and develops techniques that exploit the structure of specific loss functions - such as the squared loss function - to obtain efficient algorithms.

Sparse Local Embeddings for Extreme Multi-label Classification

The SLEEC classifier is developed for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels and can make significantly more accurate predictions then state-of-the-art methods including both embedding-based as well as tree-based methods.

On the Optimality of Classifier Chain for Multi-label Classification

This work first generalizes the CC model over a random label order, then presents a theoretical analysis of the generalization error for the proposed generalized model and proposes a dynamic programming based classifier chain (CC-DP) algorithm to search the globally optimal label order for CC and a greedy classifiers chain ( CC-Greedy) algorithms to find a locally optimal CC.

Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications

The choice of the loss function is critical in extreme multi-label learning where the objective is to annotate each data point with the most relevant subset of labels from an extremely large label

Classifier chains for multi-label classification

This paper presents a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity, and illustrates the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA

We introduce a novel approach for estimating Latent Dirichlet Allocation (LDA) parameters from collapsed Gibbs samples (CGS), by leveraging the full conditional distributions over the latent variable