• Corpus ID: 216867827

SegaBERT: Pre-training of Segment-aware BERT for Language Understanding

  title={SegaBERT: Pre-training of Segment-aware BERT for Language Understanding},
  author={He Bai and Peng Shi and Jimmy J. Lin and Luchen Tan and Kun Xiong and Wen Gao and Ming Li},
Pre-trained language models have achieved state-of-the-art results in various natural language processing tasks. Most of them are based on the Transformer architecture, which distinguishes tokens with the token position index of the input sequence. However, sentence index and paragraph index are also important to indicate the token position in a document. We hypothesize that better contextual representations can be generated from the text encoder with richer positional information. To verify… 

Figures and Tables from this paper

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

A unified framework named ERNIE 3.0 is proposed for pre-training large-scale knowledge enhanced models that fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few- shot learning or fine-tuning.

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Two welldesigned techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-DOC 1, which has a much longer effective context length, to capture the contextual information of a complete document.

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

A hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters is trained, which is the largest Chinese dense pre-trained model so far and outperforms the state-of-the-art models on 68 NLP datasets.

MOOCRep: A Unified Pre-trained Embedding of MOOC Entities

MOOCRep is developed, a novel method based on Transformer language model trained with two pre-training objectives: graph-based objective to capture the powerful signal of entities and relations that exist in the graph, and domain-oriented objective to effectively incorporate the complexity level of concepts.

A Primer in BERTology: What We Know About How BERT Works

This paper is the first survey of over 150 studies of the popular BERT model, reviewing the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression.

What Makes a Star Teacher? A Hierarchical BERT Model for Evaluating Teacher's Performance in Online Education

A hierarchical course BERT model is proposed that can capture the hierarchical structure within each course as well as the deep semantic features extracted from the course content and achieves significant gain over several state-of-the-art methods.

Risk-aware Regularization for Opinion-based Portfolio Selection

Every investor faces the risk-return tradeoff when making investment decisions. Most of the investors construct a portfolio instead of putting all of their wealth on a certain stock. However, most of



StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

Inspired by the linearization exploration work of Elman, BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training, and the new model is adapted to different levels of language understanding required by downstream tasks.

Text Summarization with Pretrained Encoders

This paper introduces a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences and proposes a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Character-Level Language Modeling with Deeper Self-Attention

This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension

This work introduces KT-NET, which employs an attention mechanism to adaptively select desired knowledge from KBs, and then fuses selected knowledge with BERT to enable context- and knowledge-aware predictions.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.