Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

@inproceedings{Goot2018BleachingTA,
  title={Bleaching Text: Abstract Features for Cross-lingual Gender Prediction},
  author={Rob van der Goot and Nikola Ljubesic and Ian Matroos and M. Nissim and Barbara Plank},
  booktitle={ACL},
  year={2018}
}
Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform dependent. Cross-lingual embeddings circumvent some of these limitations, but capture gender-specific style less. We propose an alternative: bleaching text, i.e., transforming lexical strings into more abstract features. This study provides evidence that such features allow for better transfer across languages. Moreover, we present… Expand

Tables and Topics from this paper

Gender prediction using lexical, morphological, syntactic and character-based features in Dutch
TLDR
This work provides a comparison of different features across genres in two types of tasks and presents two pipelines, finding that lexical features are more significant, although other features also show good results making the model more robust. Expand
Investigating cross-lingual training for offensive language detection
TLDR
It is shown that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. Expand
Overview of the EVALITA 2018 Cross-Genre Gender Prediction (GxG) Task
TLDR
Results from a total of 50 different runs show that the Gender Cross-Genre task is difficult to learn in itself: while almost all runs beat a 50% baseline, no model reaches an accuracy above 70%. Expand
Simple n-gram based models perform well for gender prediction . Sometimes
In this paper we describe our participation in the Evalita 2018 GxG crossgenre/domain gender prediction shared task for Italian. Building on previous results obtained on in-genre gender prediction,Expand
Gender Prediction from Tweets: Improving Neural Representations with Hand-Crafted Features
TLDR
A RNN model with Attention (RNNwA) is proposed to predict the gender of a twitter user using their tweets to achieve state-of-the-art performance on English and has competitive results on Spanish and Arabic. Expand
Sentence-Level BERT and Multi-Task Learning of Age and Gender in Social Media
TLDR
This work exploits a newly-created Arabic dataset with ground truth age and gender labels to learn these attributes both individually and in a multi-task setting at the sentence level, and builds models with gated recurrent units and bidirectional encoder representations from transformers (BERT). Expand
TAG-it @ EVALITA2020: Overview of the Topic, Age, and Gender Prediction Task for Italian
TLDR
It is observed that topic and gender are easier to predict than age, which might be due to the larger evidence per author provided at this edition, as well as to the availability of pre-trained large models for fine-tuning, which have shown improvement on very many NLP tasks. Expand
Exploring Combining Training Datasets for the CLIN 2019 Shared Task on Cross-genre Gender Detection in Dutch
  • G. Bouma
  • Psychology, Computer Science
  • GxG@CLIN
  • 2019
TLDR
This work starts from a simple logistic regression model with commonly used features, and considers two ways of combining training data from different sources, which do reasonably well, but the cross-genre models are a lot worse. Expand
Cross-domain and Cross-lingual Abusive Language Detection: A Hybrid Approach with Deep Learning and a Multilingual Lexicon
TLDR
It is shown that training a system on general abusive language datasets will produce a cross-domain robust system, which can be used to detect other more specific types of abusive content, and found that using the domain-independent lexicon HurtLex is useful to transfer knowledge between domains and languages. Expand
An Analysis of Gender and Bias Throughout the Machine Learning Lifecyle∗
Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systematic biases in coreference resolution systems,Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
Language-independent Gender Prediction on Twitter
TLDR
The classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and testing on the same language, it regularly outperforms the bag of words model when applied to different languages, showing very stable results across various languages. Expand
N-GrAM: New Groningen Author-profiling Model
TLDR
The aim was to create a single model for both gender and language, and for all language varieties, which is a linear support vector machine (SVM) with word unigrams and character 3- to 5-grams as features. Expand
A Survey of Cross-lingual Word Embedding Models
TLDR
A comprehensive typology of cross-lingual word embedding models is provided, showing that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. Expand
Motivating Personality-aware Machine Translation
TLDR
It is shown that both translation of the source training data into the target language, and the target testData into the source language has a detrimental effect on the accuracy of predicting author traits, which supports the need for personal and personality-aware machine translation models. Expand
A survey of cross-lingual embedding models
TLDR
This work surveys models that seek to learn cross-lingual embeddings and discusses them based on the type of approach and the nature of parallel data that they employ. Expand
Gender identity and lexical variation in social media
TLDR
Pairing computational methods and social theory offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms. Expand
Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre
TLDR
This work is the first that consciously avoids gender bias in topics, thereby providing stronger evidence to gender-specific styles in language beyond topic, and the comparative study provides new insights into robustness of various stylometric techniques across topic and genre. Expand
Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
TLDR
In this paper, insights from sociolinguistics and data collected through an online game are combined to underline the importance of approaching age and gender as social variables rather than static biological variables. Expand
Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
TLDR
This work presents a novel bi-LSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words, which obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages. Expand
Simple Queries as Distant Labels for Predicting Gender on Twitter
TLDR
This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries and offers a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification. Expand
...
1
2
3
4
...