Towards Zero-shot Language Modeling

  title={Towards Zero-shot Language Modeling},
  author={E. Ponti and Ivan Vulic and Ryan Cotterell and Roi Reichart and Anna Korhonen},
Can we construct a neural language model which is inductively biased towards learning human language? Motivated by this question, we aim at constructing an informative prior for held-out languages on the task of character-level, open-vocabulary language modelling. We obtain this prior as the posterior over network weights conditioned on the data from a sample of training languages, which is approximated through Laplace’s method. Based on a large and diverse sample of languages, the use of our… 

Figures and Tables from this paper

Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages
A Bayesian generative model for the space of neural parameters is proposed that can be factorized into latent variables for each language and each task, and infer the posteriors over such latent variables based on data from seen task–language combinations through variational inference.
Emergent Communication Pretraining for Few-Shot Machine Translation
It is shown that grounding communication on images—as a crude approximation of real-world environments—inductively biases the model towards learning natural languages, and the potential of emergent communication pretraining for both natural language processing tasks in resource-poor settings and extrinsic evaluation of artificial languages is revealed.
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
This work introduces Cross-lingual Choice of Plausible Alternatives (XCOPA), a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages, revealing that current methods based on multilingual pretraining and zero-shot fine-tuning transfer suffer from the curse of multilinguality and fall short of performance in monolingual settings by a large margin.
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer
MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features, offers substantial benefits for low-resource languages, particularly on the NER task in low- resource African languages.
Minimax and Neyman–Pearson Meta-Learning for Outlier Languages
Two variants of MAML are created based on alternative criteria that reduce the maximum risk across languages, while Neyman–Pearson MAMl constrains the risk in each language to a maximum threshold, which constitute fully differentiable two-player games.
Universal linguistic inductive biases via meta-learning
This work introduces a framework for giving particular linguistic inductive biases to a neural network model; such a model can then be used to empirically explore the effects of those inductive bias effects.
Pretrained Transformers as Universal Computation Engines
It is found that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks and enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks1.
SIGTYP 2020 Shared Task: Prediction of Typological Features
It is revealed that even the strongest submitted systems struggle with predicting feature values for languages where few features are known, and the most successful methods make use of such feature correlations.
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity
The public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses can be helpful in guiding future developments in multilingual lexical semantics and representation learning are made available via a Web site that will encourage community effort in further expansion ofMulti-Simlex.


What Kind of Language Is Hard to Language-Model?
A new paired-sample multiplicative mixed-effects model is introduced to obtain language difficulty coefficients from at-least-pairwise parallel corpora and it is shown that “translationese” is not any easier to model than natively written language in a fair comparison.
Contextual Parameter Generation for Universal Neural Machine Translation
This approach requires no changes to the model architecture of a standard NMT system, but instead introduces a new component, the contextual parameter generator (CPG), that generates the parameters of the system (e.g., weights in a neural network).
Regularizing and Optimizing LSTM Language Models
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Matching Networks for One Shot Learning
This work employs ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories to learn a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types.
Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction
The main technical contribution of this work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction.
Cross-Lingual Word Embeddings for Low-Resource Language Modeling
This work investigates the use of bilingual lexicons to improve language models when textual training data is limited to as few as a thousand sentences, and involves learning cross-lingual word embeddings as a preliminary step in training monolingual language models.
An Analysis of Neural Language Modeling at Multiple Scales
This work takes existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity, achieving state- of- the-art results on character- level and word-level datasets.
Are All Languages Equally Hard to Language-Model?
This work develops an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information.
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.