• Publications
  • Influence
Proceedings of NIPS
Diffusion of Lexical Change in Social Media
Using a latent vector autoregressive model to aggregate across thousands of words, high-level patterns in diffusion of linguistic change over the United States are identified and support for prior arguments that focus on geographical proximity and population size is offered. Expand
Proceedings of EMNLP
Random Feature Attention
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Expand
The Right Tool for the Job: Matching Model and Instance Complexities
This work proposes a modification to contextual representation fine-tuning which allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances during inference. Expand
Bayesian Optimization of Text Representations
This work applies a sequential model-based optimization technique and shows that this method makes standard linear models competitive with more sophisticated, expensive state-of-the-art methods based on latent variable models or neural networks on various topic classification and sentiment analysis problems. Expand
Evaluating Models’ Local Decision Boundaries via Contrast Sets
A more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data, and recommends that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Expand
Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
The speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Expand
Shortformer: Better Language Modeling using Shorter Inputs
This work shows that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time and gives a large improvement in perplexity and improves perplexity on WikiText-103, without adding any parameters. Expand