Tracking Knowledge Propagation Across Wikipedia Languages

  title={Tracking Knowledge Propagation Across Wikipedia Languages},
  author={Roldolfo Valentim and Giovanni V. Comarela and Souneil Park and Diego S{\'a}ez-Trumper},
In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow-up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this… 

Figures from this paper

TWikiL - The Twitter Wikipedia Link Dataset

Recent research has shown how strongly connected Wikipedia and other web applications are. For example, search engines rely heavily on surfacing Wikipedia links to satisfy their users’ information

Cross-Lingual GenQA: Open-Domain Question Answering with Answer Sentence Generation

This paper introduces G EN -T Y D I QA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian questions and presents the first Cross-Lingual answer sentence generation system (C ROSS -L INGUAL G EN QA).

Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering Approach for Open-Domain Question Answering

This paper presents the first generalization of the GENQA approach for the multilingual environment, and presents the GEN-TYDIQA dataset, which extends the TyDiQA evaluation data with natural-sounding, well-formed answers in Arabic, Bengali, English, Japanese, and Russian.



Growing Wikipedia Across Languages via Recommendation

This paper presents an end-to-end system for recommending articles for creation that exist in one language but are missing in an- other and finds that personalizing recommendations increases editor engagement by a factor of two and articles created as a result of these recommendations are of comparable quality to organically created articles.

Interactions of Cultures and Top People of Wikipedia from Ranking of 24 Language Editions

Considering historical figures who appear in multiple editions as interactions between cultures, a network of cultures is constructed and the most influential cultures are identified according to this network.

The Evolution of Wikipedia

It is proposed that not only the degree of the destination node, but also it’s PageRank score can be used to explain the preferential generative process of graph edges, and the effectiveness of PageRank as a predictor of edge destination is evaluated.

Information arbitrage across multi-lingual Wikipedia

Analyzing four large language domains (English, Spanish, French, and German), this work presents Ziggurat, an automated system for aligning Wikipedia infoboxes, creating new inf oboxes as necessary, filling in missing information, and detecting discrepancies between parallel pages.

Understanding Editing Behaviors in Multilingual Wikipedia

Evidence is found for a complexity barrier whereby editors are less likely to edit complex content in a second language and multilinguals are less engaged and show lower levels of language proficiency in their second languages.

wikiBABEL: community creation of multilingual data

This paper describes the architectural components implementing the wikiBABEL framework, and discusses the integrated linguistic resources and tools, such as, bilingual dictionaries, machine translation and transliteration systems, etc., to help the users during the content correction and creation process.

Will this Idea Spread Beyond Academia? Understanding Knowledge Transfer of Scientific Concepts across Text Corpora

This work extracts scientific concepts from corpora as instantiations of “research ideas”, creates concept-level features as motivated by literature, and follows the trajectories of over 450,000 new concepts to identify factors that lead only a small proportion of these ideas to be used in inventions and drug trials.

Multilinguals and Wikipedia editing

This study finds multilingual users are much more active than their single-edition (monolingual) counterparts and found in all language editions, but smaller-sized editions with fewer users have a higher percentage of mult bilingual users than larger- sized editions.

Linguistic neighbourhoods: explaining cultural borders on Wikipedia through multilingual co-editing activity

This study sheds light on how culture is reflected in the collective process of archiving knowledge on Wikipedia, and demonstrates that cross-lingual interconnections on Wikipedia are not dominated by one powerful language.

The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context

This study explores language's fragmenting effect on user-generated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions and demonstrates that the diversity present is greater than has been presumed in the literature and has a significant influence on applications that use Wikipedia as a source of world knowledge.