CLUE: A Chinese Language Understanding Evaluation Benchmark

  title={CLUE: A Chinese Language Understanding Evaluation Benchmark},
  author={Liang Xu and Xuanwei Zhang and Lu Li and Hai Hu and Chenjie Cao and Weitang Liu and Junyi Li and Yudong Li and Kai Sun and Yechen Xu and Yiming Cui and Cong Yu and Qianqian Dong and Yin Tian and Dian Yu and Bo Shi and Jun-jie Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu and Quanbei Zhao and Cong Yue and Xinrui Zhang and Zhen-Yi Yang and Kyle Richardson and Zhenzhong Lan},
The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue… 

Tables from this paper

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

The first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks is introduced, releasing baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.

OCNLI: Original Chinese Natural Language Inference

This paper presents the first large-scale NLI dataset for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI), which follows closely the annotation protocol used for MNLI, but creates new strategies for eliciting diverse hypotheses.

Does Chinese BERT Encode Word Structure?

This work investigates Chinese BERT using both attention weight distribution statistics and probing tasks, finding that word information is captured by BERT; word-level features are mostly in the middle representation layers; and downstream tasks make different use of word features in BERT.

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

This paper proposes a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine- grained and coarse-grains tokenizations, which outperforms the existing best performing models in almost all cases.

Improving Machine Reading Comprehension with Contextualized Commonsense Knowledge

This paper aims to extract a new kind of structured knowledge from scripts and use it to improve MRC, and designs a teacher-student paradigm with multiple teachers to facilitate the transfer of knowledge in weakly-labeled MRC data.

Cross-lingual Inference with A Chinese Entailment Graph

This paper presents the first pipeline for building Chinese entailment graphs, which involves a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing dataset under the FIGER type ontology.

WeLM: A Well-Read Pre-trained Language Model for Chinese

A well-read pre-trained language model for Chinese that is able to seamlessly perform different types of tasks with zero or few-shot demonstrations, and has basic skills at explaining and calibrating the decisions from itself, which can be promising directions for future research.

CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations

This work proposes a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations, and implicitly encodes the coarse-grained information into the multi-Grained representations through contrastive learning on multi- grained information.

JGLUE: Japanese General Language Understanding Evaluation

A Japanese NLU benchmark is built from scratch without translation to measure the general NLU ability in Japanese, and it is hoped that JGLUE will facilitate NLU research in Japanese.

Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

This work explores the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces, and calls such word-level BERT model as WordBERT, which makes significant improvements on cloze test and machine reading comprehension.



ChID: A Large-scale Chinese IDiom Dataset for Cloze Test

A large-scale Chinese cloze test dataset ChID is proposed, which studies the comprehension of idiom, a unique language phenomenon in Chinese, in which the idioms in a passage are replaced by blank symbols and the correct answer needs to be chosen from well-designed candidate idioms.

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

The Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation is introduced.

A Span-Extraction Dataset for Chinese Machine Reading Comprehension

This paper introduces a Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area and hosted the Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018).

Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension

Experimental results demonstrate that linguistic and general world knowledge may help improve the performance of the baseline reader in both general and domain-specific tasks.

SentEval: An Evaluation Toolkit for Universal Sentence Representations

We introduce SentEval, a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multi-class classification, natural

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Pre-Training With Whole Word Masking for Chinese BERT

The whole word masking (wwm) strategy for Chinese BERT is introduced, along with a series of Chinese pre-trained language models, and a simple but effective model called MacBERT is proposed, which improves upon RoBERTa in several ways.

LCQMC:A Large-scale Chinese Question Matching Corpus

A search engine is used to collect large-scale question pairs related to high-frequency words from various domains, then filter irrelevant pairs by the Wasserstein distance, and finally recruit three annotators to manually check the left pairs to demonstrate the good quality of LCQMC.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.