Challenges and Strategies in Cross-Cultural NLP

@inproceedings{Hershcovich2022ChallengesAS,
  title={Challenges and Strategies in Cross-Cultural NLP},
  author={Daniel Hershcovich and Stella Frank and Heather Christine Lent and Miryam de Lhoneux and Mostafa Abdou and Stephanie Brandl and Emanuele Bugliarello and Laura Cabello Piqueras and Ilias Chalkidis and Ruixiang Cui and Constanza Fierro and Katerina Margatina and Phillip Rust and Anders S{\o}gaard},
  booktitle={ACL},
  year={2022}
}
Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these… 

Figures from this paper

Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems
TLDR
This work provides an extensive overview of existing methods and resources in multilingual ToD as an entry point to this exciting and emerging technology, and draws parallels between components of the ToD pipeline and other NLP tasks, which can inspire solutions for learning in low-resource scenarios.
Compositional Generalization in Multilingual Semantic Parsing over Wikidata
TLDR
This work proposes a method for creating a multilingual, parallel dataset of question-query pairs, grounded in Wikidata, and introduces such a dataset, which it is used to analyze the compositional generalization of semantic parsers in Hebrew, Kannada, Chinese and English.
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
TLDR
Six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks are outlined and applied in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks.
Vygotskian Autotelic Artificial Intelligence: Language and Culture Internalization for Human-Like AI
Building autonomous artificial agents able to grow open-ended repertoires of skills across their lives is one of the fundamental goals of AI. To that end, a promising developmental approach recommends
Managing corporate language diversity
Organisations that have offices in more than one country face the challenges of managing access to information created and circulated in more than one language. Research into the challenges and

References

SHOWING 1-10 OF 173 REFERENCES
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
TLDR
The relation between the types of languages, resources, and their representation in NLP conferences is looked at to understand the trajectory that different languages have followed over time and underlines the disparity between languages.
Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks
TLDR
The authors' analyses show that the proposed pragmatic features do capture cross-cultural similarities and align well with existing work in sociolinguistics and linguistic anthropology, and corroborate the effectiveness of pragmatically-driven transfer in the downstream task of choosing transfer languages for cross-lingual sentiment analysis.
Natural language processing for similar languages, varieties, and dialects: A survey
TLDR
The most important challenges when dealing with diatopic language variation are discussed, and some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects are presented.
XNLI: Evaluating Cross-lingual Sentence Representations
TLDR
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines.
Mining Cross-Cultural Differences and Similarities in Social Media
TLDR
A lightweight yet effective approach to mining cross-cultural differences of named entities and finding similar terms for slang across languages that could be useful for machine translation applications and research in computational social science is presented.
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
TLDR
A recent cross-lingual pre-trained model Unicoder is extended to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline and the base versions of Multilingual BERT, XLM and XLM-R are evaluated for comparison.
Cross-Cultural Transfer Learning for Text Classification
TLDR
It is shown that cross-cultural differences can be harnessed for natural language text classification, and a transfer-learning framework is presented that leverages widely-available unaligned bilingual corpora for classification tasks, using no task-specific data.
Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
TLDR
A scheme to map word vectors trained on a source language to vectors semantically compatible with word vectors training on a target language using an adversarial autoencoder is proposed and preliminary qualitative results are presented.
MasakhaNER: Named Entity Recognition for African Languages
TLDR
This work brings together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages and details the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks.
Massively Multilingual Word Embeddings
TLDR
New methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space are introduced and a new evaluation method is shown to correlate better than previous ones with two downstream tasks.
...
...