Learning Word Embeddings for Low-Resource Languages by PU Learning

@inproceedings{Jiang2018LearningWE,
  title={Learning Word Embeddings for Low-Resource Languages by PU Learning},
  author={Chao Jiang and Hsiang-Fu Yu and Cho-Jui Hsieh and Kai-Wei Chang},
  booktitle={NAACL},
  year={2018}
}
Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word… 

Figures and Tables from this paper

Evaluating Word Embeddings on Low-Resource Languages
TLDR
This paper argues that the analogy task is unsuitable for low-resource languages for two reasons: it requires that word embeddings be trained on large amounts of text, and analogies may not be well-defined in some low- Resource settings, and introduces the OddOneOut and Topk tasks, which are specifically designed for model selection in the low- resource setting.
Dirichlet-Smoothed Word Embeddings for Low-Resource Settings
TLDR
This paper revisits PPMI by adding Dirichlet smoothing to correct its bias towards rare words and finds the proposed method outperforms PU-Learning for low-resource settings and obtains competitive results for Maltese and Luxembourgish.
Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yor\`ub\'a and Twi
TLDR
This paper focuses on two African languages, Yor`ub´a and Twi, and uses different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages.
Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi
TLDR
This paper focuses on two African languages, Yorùbá and Twi, and uses different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages.
Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
TLDR
This work proposes a punctuation and parallel corpus based word embedding model that generates the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrates it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to trainword embedding.
Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
TLDR
A novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion, that consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.
Exploiting Cross-Lingual Representations For Natural Language Processing
TLDR
It is argued that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation.
PAUSE: Positive and Annealed Unlabeled Sentence Embedding
TLDR
PAUSE (Positive and Annealed Unlabeled Sentence Embedding), capable of learning high-quality sentence embeddings from a partially labeled dataset, is proposed and experimentally shows that PAUSE achieves, and sometimes surpasses, state-of-the-art results using only a small fraction of labeled sentence pairs on various benchmark tasks.
Multilingual Sentiment Analysis
TLDR
This chapter focuses on sentiment analysis of various low resource languages having limited sentiment analysis resources such as annotated datasets, word embeddings and sentiment lexicons, along with English.
Low-Resource Generation of Multi-hop Reasoning Questions
TLDR
This paper first builds a multi-hop generation model and guides it to satisfy the logical rationality by the reasoning chain extracted from a given text, and applies it to the task of machine reading comprehension and achieves significant performance improvements.
...
...

References

SHOWING 1-10 OF 35 REFERENCES
GloVe: Global Vectors for Word Representation
TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Improving Document Ranking with Dual Word Embeddings
TLDR
This paper investigates the popular neural word embedding method Word2vec as a source of evidence in document ranking and proposes the proposed Dual Embedding Space Model (DESM), which provides evidence that a document is about a query term.
Revisiting Embedding Features for Simple Semi-supervised Learning
TLDR
Experiments on the task of named entity recognition show that each of the proposed approaches can better utilize the word embedding features, among which the distributional prototype approach performs the best.
Distributed Representations of Words and Phrases and their Compositionality
TLDR
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Improving Distributional Similarity with Lessons Learned from Word Embeddings
TLDR
It is revealed that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves, and these modifications can be transferred to traditional distributional models, yielding similar gains.
Model-based Word Embeddings from Decompositions of Count Matrices
This work develops a new statistical understanding of word embeddings induced from transformed count data. Using the class of hidden Markov models (HMMs) underlying Brown clustering as a generative
A word at a time: computing word relatedness using temporal semantic analysis
TLDR
This paper proposes a new semantic relatedness model, Temporal Semantic Analysis (TSA), which captures this temporal information in word semantics as a vector of concepts over a corpus of temporally-ordered documents.
Using Wiktionary for Computing Semantic Relatedness
TLDR
It is shown that Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and the concept vector based approach yields the best results on all datasets in both evaluations.
Deep Recursive Neural Networks for Compositionality in Language
TLDR
The results show that deep RNNs outperform associated shallow counterparts that employ the same number of parameters and outperforms previous baselines on the sentiment analysis task, including a multiplicative RNN variant as well as the recently introduced paragraph vectors.
Polarity Inducing Latent Semantic Analysis
TLDR
The key contribution of this work is to show how to assign signs to the entries in the co-occurrence matrix on which LSA operates, so as to induce a subspace with the desired property.
...
...