Bi-Weighting Domain Adaptation for Cross-Language Text Classification

  title={Bi-Weighting Domain Adaptation for Cross-Language Text Classification},
  author={C. Q. Wan and Rong Pan and Jiefei Li},
Text classification is widely used in many real-world applications. To obtain satisfied classification performance, most traditional data mining methods require lots of labeled data, which can be costly in terms of both time and human efforts. In reality, there are plenty of such resources in English since it has the largest population in the Internet world, which is not true in many other languages. In this paper, we present a novel transfer learning approach to tackle the cross-language text… 

Figures and Tables from this paper

Cross Language Text Classification via Subspace Co-regularized Multi-view Learning

A novel subspace co-regularized multi-view learning method built on parallel corpora produced by machine translation that jointly minimizes the training error of each classifier in each language while penalizing the distance between the subspace representations of parallel documents.

A Domain Adaptation Method for Text Classification based on Self-adjusted Training Approach by Iv án

A self-adjusting training approach method, able to adapt itself to the new distributions obtained on a self-training process, which obtains good results on the thematic cross-domain text classification task and reduces the error rate in 65.13% on average from the supervised learning approach on the testing dataset.

Triplex Transfer Learning: Exploiting Both Shared and Distinct Concepts for Text Classification

This work systemically analyzes the high-level concepts of transfer learning, and proposes a general transfer learning framework based on nonnegative matrix trifactorization, which allows to explore both shared and distinct concepts among all the domains simultaneously.

Semi-Supervised Matrix Completion for Cross-Lingual Text Classification

The empirical results demonstrate the efficacy of the proposed approach, and show it outperforms a number of related cross-lingual learning methods.

Semi-Supervised Representation Learning for Cross-Lingual Text Classification

This paper proposes a new crosslingual adaptation approach for document classification based on learning cross-lingual discriminative distributed representations of words to maximize the loglikelihood of the documents from both language domains under aCrosslingual logbilinear document model, while minimizing the prediction log-losses of labeled documents.

Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering

This paper proposes a hierarchical multinomial Naive Bayes model with latent variables to conduct supervised word clustering on labeled documents from both source and target domains, and then uses the produced cluster distribution of each word as its latent feature representation for domain adaptation.

Distributional Correspondence Indexing for Cross-Lingual and Cross-Domain Sentiment Classification (Extended Abstract)

The experiments conducted show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification.

A Novel Two-Step Method for Cross Language Representation Learning

This paper first formulate a matrix completion problem to produce a complete parallel document-term matrix for all documents in two languages, and then induce a low dimensional cross-lingual document representation by applying latent semantic indexing on the obtained matrix.

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

A new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model, does not assume the availability of document pairs with identical topic distributions.



Extracting discriminative concepts for domain adaptation in text mining

This work proposes a domain adaptation method that parameterizes this concept space by linear transformation under which it explicitly minimize the distribution difference between the source domain with sufficient labeled data and target domains with only unlabeled data, while at the same time minimizing the empirical loss on the labeled data in the sourcedomain.

An EM based training algorithm for cross-language text categorization

A learning algorithm based on the EM scheme which can be used to train text classifiers in a multilingual environment and results show that the performance of the proposed method is very promising when applied on a test document set extracted from newsgroups in English and Italian.

Instance Weighting for Domain Adaptation in NLP

This paper formally analyze and characterize the domain adaptation problem from a distributional view, and shows that there are two distinct needs for adaptation, corresponding to the different distributions of instances and classification functions in the source and the target domains.

A Survey on Transfer Learning

The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.

Can chinese web pages be classified with english data source?

This paper proposes an information bottleneck based approach to address the cross-language classification problem of Chinese and English Web pages, and significantly improves several existing supervised and semi-supervised classifiers.

Domain Adaptation via Transfer Component Analysis

This work proposes a novel dimensionality reduction framework for reducing the distance between domains in a latent space for domain adaptation and proposes both unsupervised and semisupervised feature extraction approaches, which can dramatically reduce thedistance between domain distributions by projecting data onto the learned transfer components.

Boosting for transfer learning

This paper presents a novel transfer learning framework called TrAdaBoost, which extends boosting-based learning algorithms and shows that this method can allow us to learn an accurate model using only a tiny amount of new data and a large amount of old data, even when the new data are not sufficient to train a model alone.

Self-taught learning: transfer learning from unlabeled data

An approach to self-taught learning that uses sparse coding to construct higher-level features using the unlabeled data to form a succinct input representation and significantly improve classification performance.

Transductive Inference for Text Classification using Support Vector Machines

An analysis of why Transductive Support Vector Machines are well suited for text classi cation is presented, and an algorithm for training TSVMs, handling 10,000 examples and more is proposed.

Cross-Lingual Text Categorization

Practical and cost-effective solutions for automatic Cross-Lingual Text Categorization are described, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available.