Heterogeneous document embeddings for cross-lingual text classification

@article{Moreo2021HeterogeneousDE,
  title={Heterogeneous document embeddings for cross-lingual text classification},
  author={Alejandro Moreo and Andrea Pedrotti and Fabrizio Sebastiani},
  journal={Proceedings of the 36th Annual ACM Symposium on Applied Computing},
  year={2021}
}
Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The metaclassifier can thus exploit class-class… 

Tables from this paper

Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification

An instance of gFun is described, a generalisation of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-independent representation of the (monolingual) document.

Word-class embeddings for multiclass text classification

This work proposes (supervised) word-class embeddings (WCEs), and shows that, when concatenated to (unsupervised), they substantially facilitate the training of deep-learning models in multiclass classification by topic.

D 3 . 1 Initial Outcomes of New Learning Paradigms

This document presents the initial outcomes of the research on new learning paradigms in WP3 of AI4Media. As such, the document summarizes the research advances of the contributing partners in tasks

References

SHOWING 1-10 OF 72 REFERENCES

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Polylingual Text Classification

This work tackles multilabel CLC via funnelling, a new ensemble learning method that is shown to significantly outperform a number of state-of-the-art baselines, and presents substantial experiments, run on publicly available multilingual text collections.

Word-class embeddings for multiclass text classification

This work proposes (supervised) word-class embeddings (WCEs), and shows that, when concatenated to (unsupervised), they substantially facilitate the training of deep-learning models in multiclass classification by topic.

Joint Embedding of Words and Labels for Text Classification

This work proposes to view text classification as a label-word joint embedding problem: each label is embedded in the same space with the word vectors, and introduces an attention framework that measures the compatibility of embeddings between text sequences and labels.

Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

This work proposes a cross-lingual teacher-student method, CLTS, that generates “weak” supervision in the target language using minimal cross-lingsual resources, in the form of a small number of word translations.

Multilingual and cross-lingual document classification: A meta-learning approach

This work proposes a simple, yet effective adjustment to existing meta-learning methods which allows for better and more stable learning, and sets a new state-of-the-art on a number of languages while performing on-par on others, using only a small amount of labeled data.

Lightweight Random Indexing for Polylingual Text Classification

Random Indexing (RI) is shown to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that are used as baselines.

Distributional Correspondence Indexing for Cross-Lingual and Cross-Domain Sentiment Classification

This paper presents the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification, and shows that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification.

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

GILE: A Generalized Input-Label Embedding for Text Classification

This paper proposes a new input-label model that generalizes over previous such models, addresses their limitations, and does not compromise performance on seen labels and outperforms monolingual and multilingual models that do not leverage label semantics and previous joint input- label space models in both scenarios.
...