yosm: A new yoruba sentiment corpus for movie reviews

  title={yosm: A new yoruba sentiment corpus for movie reviews},
  author={Iyanuoluwa Shode and David Ifeoluwa Adelani and Anna Feldman},
Sentiment Analysis is a popular text classification task in natural language processing. It involves developing algorithms or machine learning models to determine the sentiment or opinion expressed in a piece of text. The results of this task can be used by business owners and product developers to understand their consumers’ perceptions of their products. Asides from customer feedback and product/service analysis, this task can be useful for social media monitoring (Martin et al., 2021). One of… 

Tables from this paper




This work considered two different datasets both pre-dominantly pertaining to IMDB as source, which composed only textual content which was processed by removing unnecessary contents and distributed into two categories namely positive and negative.

Sentiment Analysis on Movie Reviews

This project explored the use of various supervised machine learning algorithms in learning sentiment classifier and tested the effectiveness of different feature selection algorithms in improving those classifiers.

Sentiment Analysis on Urdu Tweets Using Markov Chains

A sentiment analysis approach based on Markov chains for predicting the sentiment of Urdu tweets outperforms the lexicon-based and traditional machine learning-based approaches of sentiment analysis.

Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models

The challenges in building a sentiment analysis system for Amharic are investigated and it is found that the widespread usage of sarcasm and figurative speech are the main issues in dealing with the problem.

Sentiment Classification in Swahili Language Using Multilingual BERT

This study performs sentiment classification on Swahili datasets by using the current state of the art model, multilingual BERT, and achieves the best accuracy of 87.59%.

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

The first large-scale human-annotated Twitter sentiment dataset for Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language is introduced, including a significant fraction of code-mixed tweets.

The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation

This paper presents MENYO-20k, the first multi-domain parallel corpus with a special focus on clean orthography for Yorùbá–English with standardized train-test splits for benchmarking and investigates how and when this training condition affects the final quality and intelligibility of a translation.

Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi

This paper focuses on two African languages, Yorùbá and Twi, and uses different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages.

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

It is shown that it is possible to train competitive multilingual language models on less than 1 GB of text and results suggest that the “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.

MasakhaNER: Named Entity Recognition for African Languages

This work brings together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages and details the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks.