• Publications
  • Influence
Character-Level Language Modeling with Deeper Self-Attention
TLDR
In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks. Expand
  • 128
  • 17
  • PDF
Bridging the Gap for Tokenizer-Free Language Models
TLDR
We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts. Expand
  • 2
  • PDF