• Corpus ID: 232269775

Modeling the Second Player in Distributionally Robust Optimization

  title={Modeling the Second Player in Distributionally Robust Optimization},
  author={Paul Michel and Tatsunori B. Hashimoto and Graham Neubig},
Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the “uncertainty set”). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple… 

Figures and Tables from this paper

Distributionally Robust Models with Parametric Likelihood Ratios
we three simple ideas – mini-batch level normalization, a KL penalty and simultaneous gradient updates – allow us to train models with DRO using a broader class of parametric likelihood ratios. In a
A Tale of Two Models: Constructing Evasive Attacks on Edge Models
This paper introduces a new evasive attack, DIVA, that exploits differences in edge adaptation, by adding adversarial noise to input data that maximizes the output difference between the original and adapted model.
Examining and Combating Spurious Features under Distribution Shift
This paper defines and analyzes robust and spurious representations using the information-theoretic concept of minimal sufficient statistics, and proves that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
FedAug: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data
It is shown that FedAug consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components (i.e., AugMean and AugCA) have individual performance gains, and demonstrates that the DG algorithms help to enhance domain robustness.
Distributionally Robust Finetuning BERT for Covariate Drift in Spoken Language Understanding
This study investigates robustness against covariate drift in spoken language understanding (SLU) and investigates distributionally robust optimization (DRO) for finetuning BERT-based models to mitigate the performance loss.
Boosted CVaR Classification
The Boosted CVaR Classification framework is proposed which is motivated by a direct relationship betweenCVaR and a classical boosting algorithm called LPBoost, and an algorithm called α-AdaLPBoost is designed which achieves higher tail performance than deterministic model training methods.
DORO: Distributional and Outlier Robust Optimization
This work applies DRO to real, large-scale tasks with subpopulation shift, and observes that DRO performs relatively poorly, and moreover has severe instability, and proposes the framework of DORO, for Distributional and Outlier Robust Optimization, which prevents DRO from overfitting to potential outliers.


Distributionally Robust Language Modeling
An approach which trains a model that performs well over a wide range of potential test distributions, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language models are trained on a mixture of Yelp reviews and news and tested only on reviews.
The Risk of Racial Bias in Hate Speech Detection
This work proposes *dialect* and *race priming* as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.
Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior
This work proposes an incremental and iterative methodology, that utilizes the power of crowdsourcing to annotate a large scale collection of tweets with a set of abuse-related labels, and identifies a reduced but robust set of labels.
Automated Hate Speech Detection and the Problem of Offensive Language
This work used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and labels a sample of these tweets into three categories: those containinghate speech, only offensive language, and those with neither.
Nash Convergence of Gradient Dynamics in General-Sum Games
This work analyzes the behavior of agents that incrementally adapt their strategy through gradient ascent on expected payoff, in the simple setting of two-player, two-action, iterated general-sum games, and shows that either the agents will converge to a Nash equilibrium, or if the strategies themselves do not converge, then their average payoffs will nevertheless converge to the payoffs of a Nashilibrium.
Does Distributionally Robust Supervised Learning Give Robust Classifiers?
This paper proves that the DRSL just ends up giving a classifier that exactly fits the given training distribution, which is too pessimistic, and proposes simple D RSL that overcomes this pessimism and empirically demonstrate its effectiveness.
Certifying Some Distributional Robustness with Principled Adversarial Training
This work provides a training procedure that augments model parameter updates with worst-case perturbations of training data and efficiently certify robustness for the population loss by considering a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball.
Demographic Dialectal Variation in Social Media: A Case Study of African-American English
A case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter and proposes a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and verifies that this language follows well-known AAE linguistic phenomena.
Kullback-Leibler divergence constrained distributionally robust optimization
The main contribution of the paper is to show that the KL divergence constrained DRO problems are often of the same complexity as their original stochastic programming problems and, thus, KL divergence appears a good candidate in modeling distribution ambiguities in mathematical programming.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.