• Publications
  • Influence
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
Unsupervised Data Augmentation for Consistency Training
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue.
Good Semi-supervised Learning That Requires a Bad GAN
Theoretically, it is shown that given the discriminator objective, good semisupervised learning indeed requires a bad generator, and a novel formulation based on the analysis that substantially improves over feature matching GANs is derived, obtaining state-of-the-art results on multiple benchmark datasets.
Unsupervised Data Augmentation
UDA has a small twist in that it makes use of harder and more realistic noise generated by state-of-the-art data augmentation methods, which leads to substantial improvements on six language tasks and three vision tasks even when the labeled set is extremely small.
Controllable Invariance through Adversarial Feature Learning
This paper shows that the proposed framework induces an invariant representation, and leads to better generalization evidenced by the improved performance on three benchmark tasks.
Meta Pseudo Labels
We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art
Pay Attention to MLPs
This work proposes a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and shows that it can perform as well as Transformers in key language and vision applications and can scale as much as Transformers over increased data and compute.
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
(b)). These results suggest zero-shot cross-modality transfer emerges with the scaling of weakly labeled data.