Character-Level Language Modeling with Deeper Self-Attention

@inproceedings{AlRfou2019CharacterLevelLM,
  title={Character-Level Language Modeling with Deeper Self-Attention},
  author={Rami Al-Rfou and Dokook Choe and Noah Constant and Mandy Guo and Llion Jones},
  booktitle={AAAI},
  year={2019}
}
  • Rami Al-Rfou, Dokook Choe, +2 authors Llion Jones
  • Published in AAAI 2019
  • Computer Science, Mathematics
  • LSTMs and other RNN variants have shown strong performance on character-level language modeling. [...] Key Result To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.Expand Abstract
    128 Citations

    Figures, Tables, and Topics from this paper

    Bridging the Gap for Tokenizer-Free Language Models
    • 2
    • PDF
    Language Modeling with Transformer
    • Jing Zhang, Jian Ping Li, H. Li
    • Computer Science
    • 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing
    • 2019
    Adaptive Input Representations for Neural Language Modeling
    • 117
    • PDF
    An Empirical Study of Efficient ASR Rescoring with Transformers
    • 2
    • PDF
    SegaBERT: Pre-training of Segment-aware BERT for Language Understanding
    • 3
    Language Modeling with Deep Transformers
    • 50
    • PDF
    TRANSFORMER-XL: LANGUAGE MODELING
    • 2018
    • Highly Influenced
    Character-Level Translation with Self-attention
    • 4
    • PDF
    How Much Self-Attention Do We Needƒ Trading Attention for Feed-Forward Layers
    • 6
    • PDF

    References

    SHOWING 1-10 OF 61 REFERENCES
    Regularizing and Optimizing LSTM Language Models
    • 671
    • PDF
    Multiplicative LSTM for sequence modelling
    • 108
    • PDF
    LSTM Neural Networks for Language Modeling
    • 1,220
    • PDF
    Exploring the Limits of Language Modeling
    • 790
    • PDF
    MuFuRU: The Multi-Function Recurrent Unit
    • 8
    • PDF
    End-To-End Memory Networks
    • 1,657
    • PDF
    Very Deep Convolutional Networks for Natural Language Processing
    • 238
    • PDF