Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus

@article{Mehdi2017ExcavatingTM,
  title={Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus},
  author={Mohamad Mehdi and Chitu Okoli and Mostafa Mesgari and Finn {\AA}rup Nielsen and Arto Lanam{\"a}ki},
  journal={Inf. Process. Manag.},
  year={2017},
  volume={53},
  pages={505-529}
}
Testing the validity of Wikipedia categories for subject matter labelling of open-domain corpus data
TLDR
The results suggest that at least the taxonomy derived from the Wikipedia category system is not a valid instrument for manual subject matter labelling of open-domain text corpora.
Use of Wikipedia categories on information retrieval research: a brief review
TLDR
This paper adopts a systematic literature review approach, in order to identify different approaches and uses of Wikipedia categories in information retrieval research, and shows that in many cases research approaches applied and results obtained can be integrated into a comprehensive and inclusive concept of information retrieval.
Exploring the Domain of Information “Users”: Semantic Analysis of Wikipedia Articles
TLDR
The findings reveal that Wikipedia covers various topics of the information users domain, ranging from information search behavior, information retrieval, human-computer interaction, user experience, human factors, and to others.
Title Computing controversy : Formal model and algorithms fordetecting controversy on Wikipedia and in search queries
TLDR
A formal model of controversy is introduced as the basis of computational approaches to detecting controversial concepts and a classification based method for automatic detection of controversial articles and categories in Wikipedia is proposed.
Open semantic analysis: The case of word level semantics in Danish
TLDR
Data-driven models for Danish semantic relatedness, word intrusion and sentiment prediction are described and it is found that logistic regression and large random forests perform well with semantic representations.
Evaluation of Naive Bayes and Support Vector Machines for Wikipedia
TLDR
This work compares and illustrates the effectiveness of two standard classifiers in the text classification literature, Naive Bayes and Support Vector Machines, on the full English Wikipedia corpus for six different categories, and shows that SVM (linear kernel) performs exceptionally across all categories.
...
...

References

SHOWING 1-10 OF 150 REFERENCES
Mining Meaning from Wikipedia
A knowledge-based search engine powered by wikipedia
TLDR
Koru is a new search interface that offers effective domain-independent knowledge-based information retrieval that exhibits an understanding of the topics of both queries and documents, and is capable of lending assistance to almost every query issued to it.
Wikipedia-based Semantic Interpretation for Natural Language Processing
TLDR
This work proposes a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts, which represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence.
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
TLDR
This paper presents two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge.
Learning for information extraction: from named entity recognition and disambiguation to relation extraction
TLDR
This research uses Wikipedia as a repository of named entities and proposes a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy.
Wikitology: a novel hybrid knowledge base derived from wikipedia
TLDR
The value of the derived knowledge base is demonstrated by developing problem specific intelligent approaches that exploit Wikitology for a diverse set of use cases, namely, document concept prediction, cross document co-reference resolution, Entity Linking to KB entities defined as a part of Text Analysis Conference - Knowledge Base Population Track 2009 and interpreting tables.
Expert-Built and Collaboratively Constructed Lexical Semantic Resources
TLDR
A comprehensive overview of the lexical semantic knowledge therein is provided and a review of work on orchestrating different resources in order to combine their strengths and explore their use in major NLP applications is reviewed.
Learning to link with wikipedia
TLDR
This paper explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles, and performs very well, with recall and precision of almost 75%.
Using Wikipedia knowledge to improve text classification
TLDR
Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm.
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis
TLDR
This work proposes Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia that results in substantial improvements in correlation of computed relatedness scores with human judgments.
...
...