• Corpus ID: 222291168

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

  title={Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models},
  author={Zirui Wang and Yulia Tsvetkov and Orhan Firat and Yuan Cao},
Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss… 

Informative Language Representation Learning for Massively Multilingual Neural Machine Translation

Two methods, language embedding embodiment and language-aware multi-head attention, are proposed, to learn informative language representations to guide translation into right directions and linguistic typology prediction experiments show that matrix-based language representations learned by these methods are capable of capturing rich linguisticTypology features.

Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning

This paper proposes a simple yet effective method that can efficiently align gradients between tasks, and extensively validate its method on various multi-task learning and zero-shot cross-lingual transfer tasks, where it largely outperforms all the relevant baselines.

Multi-Task Learning in Natural Language Processing: An Overview

An overview of the use of MTL in NLP tasks is given and optimization techniques on loss construction, data sampling, and task scheduling to properly train a multi-task model are presented.

FairRoad: Achieving Fairness for Recommender Systems with Optimized Antidote Data

This paper proposes a new approach called fair recommendation with optimized antidote data (FairRoad), which aims to improve the fairness performances of recommender systems through the construction of a small and carefully crafted antidote dataset.

Improving Multi-Task Generalization via Regularizing Spurious Correlation

It is theoretically and empirically show that MTL is more prone to taking non-causal knowledge from other tasks than single-task learning, thus generalizing worse, and proposes Multi-Task Causal Representation Learning (MT-CRL), which aims to represent multi-task knowledge via disentangled neural modules, and learn which module is causally related to each task via MTL-specific invariant regularization.

Structured Multi-task Learning for Molecular Property Prediction

A method called SGNN-EBM is proposed to systematically investigate the structured task modeling from two perspec-tives, which can be e-ciently trained through noise-contrastive estimation (NCE) approach.

Demystify Optimization Challenges in Multilingual Transformers

A principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), is proposed, which improves both optimization and generalization especially for low resource.

Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are

Personalizing Intervened Network for Long-tailed Sequential User Behavior Modeling

A novel Gradient Aggregation technique that learns common knowledge shared by all users into a backbone model, fol-lowed by separate plugin prediction networks for the head users and the tail users personalization is proposed.

Dynamic Restrained Uncertainty Weighting Loss for Multitask Learning of Vocal Expression

We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask



Gradient Surgery for Multi-Task Learning

This work identifies a set of three conditions of the multi-task optimization landscape that cause detrimental gradient interference, and develops a simple yet general approach for avoiding such interference between task gradients.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

This work sets a milestone by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples, and demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines.

How Multilingual is Multilingual BERT?

It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs.

Multi-Task Learning as Multi-Objective Optimization

This paper proposes an upper bound for the multi-objective loss and shows that it can be optimized efficiently, and proves that optimizing this upper bound yields a Pareto optimal solution under realistic assumptions.

GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

A gradient normalization (GradNorm) algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes is presented, showing that for various network architectures, for both regression and classification tasks, and on both synthetic and real datasets, GradNorm improves accuracy and reduces overfitting across multiple tasks.

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.

On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment

The results show that negative interference is more common than previously known, suggesting new directions for improving multilingual representations and presenting a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.