Corpus ID: 14287779

The Gold Standard in Corpus Annotation

  title={The Gold Standard in Corpus Annotation},
  author={Lars Wi{\ss}ler and Mohammed Almashraee and Dagmar Monett and Adrian Paschke},
  booktitle={IEEE GSC},
Trustworthy corpora are necessary for training and meaningful evaluation of algorithms which use annotations. These standard collections are called Gold Standard Corpora (GSC). However the construction of GSC is a laborious and time-consuming process and size, quality and most of all availability of task-specific GSC directly influence the development of machine learning based natural language processing algorithms. This paper provides an introduction to gold standard corpus construction in the… Expand
Gold-standard ontology-based anatomical annotation in the CRAFT Corpus
This newly created set of annotations is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. Expand
Machine Learning-based approach to automatic POS tagging of Macedonian language
This paper presents the research that has contributed to the creation of an automatic part-of-speech (POS) tagger of Macedonian, a Slavic language that has a rich morphology, but limited languageExpand
Design and Implementation of German Legal Decision Corpora
Law professionals are wordsmiths, their main tool is language. Therefore, the field of law produces a vast amount of written text. These texts have to be analysed, summarised, and used in theExpand
A Crowd Science framework to support the construction of a Gold Standard Corpus for Plagiarism Detection
A framework to support the construction of a Gold Standard Corpus for Plagiarism Detection in any language and a Crowd Science project that employs human processing power to identify plagiarism in pairs of textual data extracted via the data acquisition process is presented. Expand
Statistic Supported Cooperative Creation of Training Corpora for the Extraction of Traffic Information from Microblogs
This paper presents an approach for the cooperative creation of annotated corpora for training and validation of information extraction systems supported by statistical analyses. Expand
Guidelines for building a Gold Standard Corpus of argumentative discourse
This paper explains Adpositional Argumentation (AdArg), a new method for annotating arguments expressed in natural language. In describing this method, it provides the guidelines for designing a GoldExpand
Towards Classifying Parts of German Legal Writing Styles in German Legal Judgments
The main tool of a lawyer is their language. Legal prose is bound by writing styles, especially in Germany. These styles ensure that, i.a. judgments are written in a structured and comprehensive way.Expand
Analysis of Sentiment Direction Based on Two Centuries of the Hansard Debate Archive
This project was conceived by Nalanda Technology, a text data search and analysis company. This study concerned the following central research question: Can Machine Learning techniques be used toExpand
Visual Interactive Comparison of Part-of-Speech Models for Domain Adaptation
An interactive visualization approach is presented that facilitates analysts in determining part-of-speech tagging errors by comparing several standard part- of-speech tagger results graphically and allows users to explore, compare, evaluate, and adapt the results through interactive feedback in order to obtain a new model, which can then be applied to similar types of texts. Expand
Automated code compliance checking in the construction domain using semantic natural language processing and logic-based reasoning
A new ACC method that utilizes semantic natural language processing (NLP) techniques to automatically extract regulatory information from building codes and design Information from building information models (BIMs) and utilizes a semantic logic-based representation to represent and reason about the extracted regulatory information and design information for compliance checking is proposed. Expand


Analysing Wikipedia and Gold-Standard Corpora for NER Training
A Wikipedia corpus is developed which outperforms gold standard corpora on cross-corpus evaluation by up to 11% and identifies the causes of poor cross-Corpus performance and demonstrates ways of making them more compatible. Expand
An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data
The issue whether a corpus annotated by means of AL can be re-used to train classifiers different from the ones employed by AL, supplying alternative feature sets as well is addressed. Expand
Gold standard datasets for evaluating word sense disambiguation programs
The background, challenges and strategies are discussed, and a detailed methodology for ensuring that the gold standard is not fool's gold is presented. Expand
A method for determining the number of documents needed for a gold standard corpus
  • D. Juckett
  • Computer Science, Medicine
  • J. Biomed. Informatics
  • 2012
A method is outlined to determine gold standard size based on the capture probabilities for the unique words within a target corpus, and it is shown that a representative sample, of justifiable size, can be selected for use as a gold standard. Expand
Assessing the practical usability of an automatically annotated corpus
It is shown that it is possible to automatically improve the quality and the quantity of the SSC annotations and that considering only those sentences of SSC which contain annotations rather than the full SSC results in a performance boost. Expand
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks. Expand
The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set, the first large-scale approach to generate an annotated corpus from automated annotation systems. Expand
KAFnotator : a multilingual semantic text annotation tool
At present, the availability of high quality annotated corpora is fundamental to carry out or to evaluate several Natural Language Processing and Text Mining tasks. To create consistently annotatedExpand
A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation
This paper examines th~s msue by comparing the agreement rate on a large corpus of more than 30,000 sense-tagged instances of the WORDNET Semcor corpus and the DSO corpus, which has been independently tagged by two separate groups of human annotators. Expand
Semantic Enrichment by Non-experts: Usability of Manual Annotation Tools
A tool for semantic annotation of digital documents is developed and an end-user study is conducted to evaluate its acceptance by and usability for non-expert users and the lessons learned are discussed. Expand