QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension
- A. Yu, David Dohan, Quoc V. Le
- Computer ScienceInternational Conference on Learning…
- 15 February 2018
A new Q\&A architecture called QANet is proposed, which does not require recurrent networks, and its encoder consists exclusively of convolution and self-attention, where convolution models local interactions andSelf-att attention models global interactions.
Finetuned Language Models Are Zero-Shot Learners
- Jason Wei, Maarten Bosma, Quoc V. Le
- Computer ScienceInternational Conference on Learning…
- 3 September 2021
It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- Zirui Wang, Jiahui Yu, A. Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
- Computer ScienceInternational Conference on Learning…
- 24 August 2021
(b)). These results suggest zero-shot cross-modality transfer emerges with the scaling of weakly labeled data.
Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks
- Lei Huang, Xianglong Liu, B. Lang, A. Yu, Bo Li
- Computer ScienceAAAI Conference on Artificial Intelligence
- 16 September 2017
It is shown that the orthogonal rectangular matrix can stabilize the distribution of network activations and regularize FNNs and be used as an alternative to standard linear module in optimization over Multiple Dependent Stiefel Manifolds.
Learning to Skim Text
- A. Yu, Hongrae Lee, Quoc V. Le
- Computer ScienceAnnual Meeting of the Association for…
- 23 April 2017
The proposed model is a modified LSTM with jumping, a recurrent network that learns how far to jump after reading a few words of the input text, which is up to 6 times faster than the standard sequential L STM, while maintaining the same or even better accuracy.
Scaling Instruction-Finetuned Language Models
- Hyung Won Chung, Le Hou, Jason Wei
- Computer ScienceArXiv
- 20 October 2022
This result shows that instruction and UL2 continued pre-training are complementary compute-efficient methods to improve the performance of language models without increasing model scale.
Combined Scaling for Zero-shot Transfer Learning
- Hieu Pham, Zihang Dai, Quoc V. Le
- Computer ScienceArXiv
- 2021
We present a combined scaling method called BASIC that achieves 85.7% top-1 zero-shot accuracy on the ImageNet ILSVRC-2012 validation set, surpassing the best-published zero-shot models – CLIP and…
Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension
- Xinyun Chen, Chen Liang, A. Yu, Denny Zhou, D. Song, Quoc V. Le
- Computer ScienceInternational Conference on Learning…
- 30 April 2020
The Neural Symbolic Reader (NeRd), which includes a reader and programmer to encode the passage and question, and a programmer, e.g., LSTM, to generate a program that is executed to produce the answer.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- Nan Du, Yanping Huang, Claire Cui
- Computer ScienceInternational Conference on Machine Learning
- 13 December 2021
This paper proposes and develops a family of language models named GLaM, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
AdaDelay: Delay Adaptive Distributed Stochastic Optimization
- S. Sra, A. Yu, Mu Li, Alex Smola
- Computer ScienceInternational Conference on Artificial…
- 2 May 2016
DStochastic convex optimization algorithms under a delayed gradient model in which server nodes update parameters and worker nodes compute stochastic (sub)gradients are developed, with noticeable improvements for large-scale real datasets with billions of examples and features.
...
...