Considerations for Multilingual Wikipedia Research

  author={Isaac Johnson and Emily A. Lescak},
English Wikipedia has long been an important data source for much research and natural language machine learning modeling. The growth of non-English language editions of Wikipedia, greater computational resources, and calls for equity in the performance of language and multimodal models have led to the inclusion of many more language editions of Wikipedia in datasets and models. Building better multilingual and multimodal models requires more than just access to expanded datasets; it also… 



Understanding Editing Behaviors in Multilingual Wikipedia

Evidence is found for a complexity barrier whereby editors are less likely to edit complex content in a second language and multilinguals are less engaged and show lower levels of language proficiency in their second languages.

Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles

The proposed method allows us to find articles with better quality that can be used to automatically enrich other language editions of Wikipedia, and the correlation between quality and popularity of Wikipedia articles of selected topics in various languages was investigated.

Wikipedia Beyond the English Language Edition

The findings show that the same power plays used in EN exist in both FA and ZH but the frequency of their usage differs across the editions, suggesting that editors in different language communities value contrasting types of policies to compete for power while discussing and editing articles.

Do We All Talk Before We Type?: Understanding Collaboration in Wikipedia Language Editions

This study leverages an influential collaboration model based on behaviors in the English Wikipedia as a lens to consider collaborative activity in the Spanish and French language editions, and demonstrates the need to account for variations in collaborative behaviors in all language editions of Wikipedia.

Wiki-40B: Multilingual Language Model Dataset

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families with around 40 billion characters is proposed, and the task of multilingual causal language modeling is introduced.

Omnipedia: bridging the wikipedia language gap

A study of Omnipedia that characterizes how people interact with information using a multilingual lens found that users actively sought information exclusive to unfamiliar language editions and strategically compared how language editions defined concepts.

The_Tower_of_Babel.jpg: Diversity of Visual Encyclopedic Knowledge Across Wikipedia Language Editions

It is found that cross-language image diversity rivals, and often exceeds, that found in text, and that many images are unique to different language editions.

In search of the ur-Wikipedia: universality, similarity, and translation in the Wikipedia inter-language link network

The number of articles in a Wikipedia edition is found to be the strongest predictor of similarity, while language similarity also appears to have an influence.

Why the World Reads Wikipedia: Beyond English Speakers

A large-scale survey of Wikipedia readers across 14 language editions with a log-based analysis of user activity advances understanding of reader motivations and behaviors across Wikipedia languages and has implications for Wikipedia editors and developers of Wikipedia and other Web technologies.

Language-agnostic Topic Classification for Wikipedia

A language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics that can be easily applied to (almost) any language and article on Wikipedia is proposed and shown that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.