SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

  title={SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese},
  author={Luan Nguyen and Kiet Van Nguyen and Ngan Luu-Thuy Nguyen},
Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on S ocial M edia T ext C lassification ( SMTC ) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource languages, is still not con-centrated on and exploited thoroughly. In-spired by the success of… 

Figures and Tables from this paper



Investigating Monolingual and Multilingual BERT Models for Vietnamese Aspect Category Detection

This research study is the first attempt at performing various available pre-trained language models on aspect category detection task and utilize the datasets from other languages based on multilingual models.

Exploiting Vietnamese Social Media Characteristics for Textual Emotion Recognition in Vietnamese

The experimental evaluation shows that with appropriate pre-processing techniques based on Vietnamese social media characteristics, Multinomial Logistic Regression achieves the best F1-score of 64.40%, a significant improvement over the CNN model built by the authors of UIT-VSMEC.

Monolingual vs multilingual BERTology for Vietnamese extractive multi-document summarization

A novel comparison between different multilingual and monolingual BERT models is introduced and results indicate thatmonolingual models produce promising results compared to other multilingual models and previous text summarizing models for Vietnamese.

A simple and efficient ensemble classifier combining multiple neural network models on social media datasets in Vietnamese

This study aims to classify Vietnamese texts on social media from three different Vietnamese benchmark datasets, using CNN, LSTM, and their variants and proposes an ensemble model, combining the highest-performance models.

CamemBERT: a Tasty French Language Model

This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.

VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension

This article presents details of the organization of the shared task, an overview of the methods employed by shared-task participants, and the results, and believes that releasing the UIT-ViQuAD 2.0 dataset motivates more researchers to improve Vietnamese machine reading comprehension.

Emotion Recognition for Vietnamese Social Media Text

A standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with exactly 6,927 emotion-annotated sentences is built, contributing to emotion recognition research in Vietnamese which is a low-resource language in natural language processing (NLP).

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

The MULTILEXNORM shared task provides the largest publicly available multilingual lexical normalization benchmark including 12 language variants and proposes a homogenized evaluation setup with both intrinsic and extrinsic evaluation.

A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts

The ViHSD a human-annotated dataset for automatically detecting hate speech on the social network is introduced and the data creation process for annotating and evaluating the quality of the dataset is introduced.

PhoBERT: Pre-trained language models for Vietnamese

Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part- of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.