Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning

  title={Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning},
  author={Yulia Tsvetkov and Manaal Faruqui and Wang Ling and Brian MacWhinney and Chris Dyer},
We use Bayesian optimization to learn curricula for word representation learning, optimizing performance on downstream tasks that depend on the learned representations as features. The curricula are modeled by a linear ranking function which is the scalar product of a learned weight vector and an engineered feature vector that characterizes the different aspects of the complexity of each instance in the training corpus. We show that learning the curriculum improves performance on a variety of… 

Figures and Tables from this paper

Learning to select data for transfer learning with Bayesian Optimization

This work proposes to learn data selection measures using Bayesian Optimization and evaluates them across models, domains and tasks, showing the importance of complementing similarity with diversity, and that learned measures are–to some degree–transferable across Models, domains, and even tasks.

Reinforcement Learning based Curriculum Optimization for Neural Machine Translation

This work uses reinforcement learning to learn a curriculum automatically, jointly with the NMT system, in the course of a single training run, and shows that this approach can beat uniform baselines on Paracrawl and WMT English-to-French datasets.

Learning a Multitask Curriculum for Neural Machine Translation

A method to learn a multitask curriculum on a single, diverse, potentially noisy training dataset, which computes multiple data selection scores for each training example, each score measuring how useful the example is to a certain task.

An Empirical Exploration of Curriculum Learning for Neural Machine Translation

A probabilistic view of curriculum learning is adopted, which lets us flexibly evaluate the impact of curricula design, and an extensive exploration on a German-English translation task shows it is possible to improve convergence time at no loss in translation quality.

To Batch or Not to Batch? Comparing Batching and Curriculum Learning Strategies across Tasks and Datasets

This work presents a systematic analysis of different curriculum learning strategies and different batching strategies, considering multiple datasets for three tasks: text classification, sentence and phrase similarity, and part-of-speech tagging.

On the Role of Corpus Ordering in Language Modeling

Empirical results of training transformer language models on English corpus and evaluating it intrinsically as well as after fine-tuning across eight tasks from the GLUE benchmark, show consistent improvement gains over conventional vanilla training.

Learning a Multi-Domain Curriculum for Neural Machine Translation

This work performs data selection for multiple domains at once by carefully introducing instance-level domain-relevance features and automatically constructing a training curriculum to gradually concentrate on multi-domain relevant and noise-reduced data batches.

Tree-Structured Curriculum Learning Based on Semantic Similarity of Text

A new CL method is proposed that considers semantic dissimilarity as the complexity measure and a tree-structured curriculum as the organization method and shows better performance than previous CL methods on a sentiment analysis task in an experiment.

Learning word representations in a developmentally realistic order

It is shown that word representations learned in a more natural order differ in some respects from those learned in the usual all at once, shuffled fashion, including showing greater stability, and how much word representations change during training depends on when they are introduced.

Understanding Learning Dynamics Of Language Models with SVCCA

This first study on the learning dynamics of neural language models is presented, using a simple and flexible analysis method called Singular Vector Canonical Correlation Analysis (SVCCA), which enables to compare learned representations across time and across models, without the need to evaluate directly on annotated data.



Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Bayesian Optimization of Text Representations

This work applies a sequential model-based optimization technique and shows that this method makes standard linear models competitive with more sophisticated, expensive state-of-the-art methods based on latent variable models or neural networks on various topic classification and sentiment analysis problems.

Curriculum learning

It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

Syntactic Processing Using the Generalized Perceptron and Beam Search

It is argued that the conceptual and computational simplicity of the framework, together with its language-independent nature, make it a competitive choice for a range of syntactic processing tasks and one that should be considered for comparison by developers of alternative approaches.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

A Systematic Exploration of Diversity in Machine Translation

It is found that diversity can improve performance on these tasks, especially for sentences that are difficult for MT.

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts

This work evaluates a system that uses interpolated predictions of reading difficulty that are based on both vocabulary and grammatical features, and indicates that Grammatical features may play a more important role in second language readability than in first languagereadability.

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms

Experimental results on part-of-speech tagging and base noun phrase chunking are given, in both cases showing improvements over results for a maximum-entropy tagger.

Self-Paced Learning for Latent Variable Models

A novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector that outperforms the state of the art method for learning a latent structural SVM on four applications.

Revisiting Readability: A Unified Framework for Predicting Text Quality

This study combines lexical, syntactic, and discourse features to produce a highly predictive model of human readers' judgments of text readability and demonstrates that discourse relations are strongly associated with the perceived quality of text.