Language-agnostic Topic Classification for Wikipedia

  title={Language-agnostic Topic Classification for Wikipedia},
  author={Isaac Johnson and Martin Gerlach and Diego S'aez-Trumper},
  journal={Companion Proceedings of the Web Conference 2021},
A major challenge for many analyses of Wikipedia dynamics—e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion—is grouping the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia’s category network, WikiProjects, and external taxonomies. However, these approaches have always been limited in their coverage… Expand
2 Citations

Figures and Tables from this paper

A Map of Science in Wikipedia
This work relies on an open dataset of citations from Wikipedia, and uses network analysis to map the relationship between Wikipedia articles and scientific journal articles, and finds that most journal articles cited from Wikipedia belong to STEM fields, in particular biology and medicine. Expand
The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification
A subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models, and experiments showing that the machine learning performance on the MRC shared task can be improved through an ensemble based on classifier stacking. Expand


What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions
An automatic evaluation and comparison of the browsing behavior of Wikipedia readers that can be applied to any language editions of Wikipedia, focusing on English, French, and Russian languages during the last four months of 2018, shows that people share a common interest and curiosity for entertainment independently of their language. Expand
Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles
The proposed method allows us to find articles with better quality that can be used to automatically enrich other language editions of Wikipedia, and the correlation between quality and popularity of Wikipedia articles of selected topics in various languages was investigated. Expand
Scalable Recommendation of Wikipedia Articles to Editors Using Representation Learning
A scalable system on top of Graph Convolutional Networks and Doc2Vec, learning how to represent Wikipedia articles and deliver personalized recommendations for editors is developed, which outperforms competitive implicit-feedback collaborative-filtering methods such as WMRF based on ALS. Expand
Structuring Wikipedia Articles with Section Recommendations
This paper defines the problem of section recommendation for Wikipedia articles and proposes several approaches for tackling it, concluding that the category-based approach works best, achieving precision@10 of about 80% in the human evaluation. Expand
Why We Read Wikipedia
These findings advance the understanding of reader motivations and behavior on Wikipedia and can have implications for developers aiming to improve Wikipedia's user experience, editors striving to cater to their readers' needs, third-party services providing access to Wikipedia content, and researchers aiming to build tools such as recommendation engines. Expand
Polylingual Topic Models
This work introduces a polylingual topic model that discovers topics aligned across multiple languages and demonstrates its usefulness in supporting machine translation and tracking topic trends across languages. Expand
WikiHist.html: English Wikipedia's Full Revision History in HTML Format
The advantages of WikiHist.html over raw Wikitext are highlighted in an empirical analysis of Wikipedia's hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext and that the missing links are important for user navigation. Expand
With Few Eyes, All Hoaxes are Deep
An effective automated topic model based on a labeling strategy that leverages a folksonomy developed by subject specific working groups in Wikipedia and a flexible ontology to arrive at a hierarchical and uniform label set is demonstrated. Expand
Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata
This paper uses an axiomatic theory for multi-level modeling to analyze current Wikidata content, and identifies a significant number of problematic classification and taxonomic statements. Expand
Identifying Semantic Edit Intentions from Revisions in Wikipedia
This work develops in collaboration with Wikipedia editors a 13-category taxonomy of the semantic intention behind edits in Wikipedia articles, and builds a computational classifier of intentions that achieved a micro-averaged F1 score of 0.621. Expand