Corpus ID: 207869719

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

@inproceedings{Diao2020ZENPC,
  title={ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations},
  author={Shizhe Diao and Jiaxin Bai and Y. Song and Tong Zhang and Yonggang Wang},
  booktitle={EMNLP},
  year={2020}
}
  • Shizhe Diao, Jiaxin Bai, +2 authors Yonggang Wang
  • Published in EMNLP 2020
  • Computer Science
  • The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose… CONTINUE READING
    11 Citations

    Figures, Tables, and Topics from this paper.

    Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge
    • 7
    • PDF
    Improving Chinese Word Segmentation with Wordhood Memory Networks
    • 7
    • Highly Influenced
    • PDF
    Improving Constituency Parsing with Span Attention
    • 5
    • PDF
    ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding
    Named Entity Recognition for Social Media Texts with Semantic Augmentation
    • 1
    • PDF
    Keyphrase Generation with Cross-Document Attention
    • 1
    • PDF
    Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information
    • 1
    • Highly Influenced
    • PDF

    References

    SHOWING 1-10 OF 54 REFERENCES
    NEZHA: Neural Contextualized Representation for Chinese Language Understanding
    • 5
    • Highly Influential
    • PDF
    An Encoding Strategy Based Word-Character LSTM for Chinese NER
    • 14
    • PDF
    Pre-Training with Whole Word Masking for Chinese BERT
    • 86
    • Highly Influential
    • PDF
    Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge
    • 7
    • Highly Influential
    • PDF
    Word-like character n-gram embedding
    • 4
    • PDF
    Improving Chinese Word Segmentation with Wordhood Memory Networks
    • 7
    • Highly Influential
    • PDF
    The Penn Chinese TreeBank : Phrase structure annotation of a large corpus
    • M A R T A P A L
    • 2005
    • 343
    • PDF
    Text Summarization with Pretrained Encoders
    • 195
    • PDF
    Enriching Word Vectors with Subword Information
    • 3,985
    • PDF
    Neural Machine Translation of Rare Words with Subword Units
    • 2,890
    • PDF