Share This Author
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Scaling Laws for Neural Language Models
Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
- Rewon Child
- Computer ScienceICLR
- 20 November 2020
This work presents a hierarchical VAE that, for the first time, outperforms the PixelCNN in log-likelihood on all natural image benchmarks and visualize the generative process and show the VAEs learn efficient hierarchical visual representations.
Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting
Systems and methods for creating and using Convolutional Recurrent Neural Networks for small-footprint keyword spotting (KWS) systems and a CRNN model embodiment demonstrated high accuracy and robust performance in a wide range of environments are described.
Exploring neural transducers for end-to-end speech recognition
- Eric Battenberg, Jitong Chen, Zhenyao Zhu
- Computer ScienceIEEE Automatic Speech Recognition and…
- 24 July 2017
It is shown that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a languagemodel, on the popular Hub5'00 benchmark.
PaLM: Scaling Language Modeling with Pathways
A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Active Learning for Speech Recognition: the Power of Gradients
- Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, S. Satheesh, Adam Coates
- Computer ScienceArXiv
- 10 December 2016
This work investigates the Expected Gradient Length approach in active learning for end-to-end speech recognition, and justifies EGL from a variance reduction perspective, and observes that EGL's measure of informativeness picks novel samples uncorrelated with confidence scores.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
The infrastructure as well as the 3D parallelism methodology used to train the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters is presented.