Enrichment of Information in Multilingual Wikipedia Based on Quality Analysis

  title={Enrichment of Information in Multilingual Wikipedia Based on Quality Analysis},
  author={Włodzimierz Lewoniewski},
Despite the fact that Wikipedia is one of the most popular sources of information in the world, it is often criticized for the poor quality of content. In this online encyclopaedia articles on the same topic can be created and edited independently in different languages. Some of this language versions can provide valuable information on a specific topics. Wikipedia articles may include infobox, which used to collect and present a subset of important information about its subject. This study… 

Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia

This study presents and classifies measures that can be extracted from Wikipedia articles for the purpose of automatic quality assessment in different languages, and describes also an extraction methods for various sources of measures, which can be used in quality assessment.

Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources

General results of Wikipedia analysis using metrics from the Toolbox SISTRIX are presented, which extracted data from more than 30 million references in different language versions of Wikipedia and analyzed over 180 thousand most popular hosts.

Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources

This work proposes to combine the logic-linguistic model and the universal dependency treebank to extract facts of various quality levels from texts to show the most significant types of facts and types of words that most affect the encyclopedic-style of the text.

The Influence of Various Text Characteristics on the Readability and Content Informativeness

The study focuses on the influence of readability and some particular features of the texts written for a global audience on the texts quality assessment, and proposes some directions on the way to automatic predicting the readability of texts in the Web.

Building the Semantic Similarity Model for Social Network Data Streams

The logical-linguistic model uses semantic and grammatical features of words to obtain a sequence of semantically related text fragments from different actors of a social network to determine semantically connected text fragments for social network data streams analysis.

Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus

The experiment shows that the more frequent synonymous collocations occur in texts, the more related topics of the texts might be, and the precision of synonymous collocation search in this experiment has achieved the results close to other studies like the authors'.

Data quality evaluation: a comparative analysis of company registers' open data in four European countries

Validation of an open data published by company registers in four different European countries shows deficiencies in the published data and demonstrates the applicability of the proposed methodology for data quality evaluation.



Analysis of References Across Wikipedia Languages

An analysis of using common references in over 10 million articles in several Wikipedia language editions: English, German, French, Russian, Polish, Ukrainian, Belarussian shows the use of similar sources and their number in language sensitive topics.

Quality and Importance of Wikipedia Articles in Different Languages

This article aims to analyse the importance of the Wikipedia articles in different languages (English, French, Russian, Polish) and the impact of the importance on the quality of articles. Based on

Modelling the Quality of Attributes in Wikipedia Infoboxes

This paper analyzes the features and models that can be used to evaluate the quality of articles, providing foundation for the relative quality assessment of infobox’s attributes, with the purpose to improve thequality of DBpedia.


Analysis of the discussion pages and other process-oriented pages within the Wikipedia project helps in understanding how high quality is maintained in a project where anyone may participate with no prior vetting.

Assessing Information Quality of a Community-Based Encyclopedia

This work proposes seven IQ metrics which can be evaluated automatically and test the set on a representative sample of Wikipedia content, along with a number of statistical characterizations of Wikipedia articles, their content construction, process metadata and social context.

Experiments with Wikipedia Cross-Language Data Fusion

A software framework for fusing RDF datasets based on different conflict resolution strategies is presented and the framework to fuse infobox data that has been extracted from the English, German, Italian and French editions of Wikipedia is applied.

Size matters: word count as a measure of quality on wikipedia

A simple metric -- word count -- is proposed for measuring article quality and it is shown that this metric significantly outperforms the more complex methods described in related work.

Identifying featured articles in wikipedia: writing style matters

A machine learning approach is presented that exploits an article's character trigram distribution and aims to writing style rather than evaluating meta features like the edit history, which is robust, straightforward to implement, and outperforms existing solutions.

Predicting quality flaws in user-generated content: the case of wikipedia

A quality flaw model is developed and a dedicated machine learning approach is employed to predict Wikipedia's most important quality flaws, arguing that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem.

Automatic Expansion of DBpedia Exploiting Wikipedia Cross-Language Information

This work extends the population of the classes for the different languages by connecting the corresponding Wikipedia pages through cross-language links, and trains a supervised classifier using this extended set of classes as training data.