Corpus ID: 2300418

DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse

  title={DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse},
  author={Richard Eckart de Castilho and Iryna Gurevych},
User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts… Expand
Approaches to Automatic Text Structuring
Two prototypes of textStructuring systems are presented, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks, and the effect of senses on computing similarities is analyzed. Expand
Bluima: a UIMA-based NLP Toolkit for Neuroscience
This paper describes Bluima, a natural language processing (NLP) pipeline focusing on the extraction of neuroscientific content and based on the UIMA framework that adds further models and tools specific to neuroscience and provides collection readers for neuroscientific corpora. Expand
Natural Language Processing: Integration of Automatic and Manual Analysis
This work develops innovative technical solutions and designs to facilitate the use of automatic analysis and to promote the integration of manual and automatic analysis, and demonstrates the adequacy of the concepts through examples which represent whole classes of research problems. Expand
Collaborative Web-Based Tools for Multi-layer Text Annotation
In this chapter, requirements for web-based annotation tools in detail are outlined and a variety of tools in respect to these requirements are reviewed and point out further directions, such as increased schema flexibility and tighter integration of automation for annotation suggestions. Expand
WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
WebAnno offers annotation project management, freely configurable tagsets and the management of users in different roles, and the architecture design allows adding additional modes of visualization and editing, when new kinds of annotations are to be supported. Expand
Study of semantic relatedness of words using collaboratively constructed semantic resources
From comprehensive intrinsic and extrinsic evaluations, it is concluded that collaboratively constructed semantic resources provide better coverage than linguistically constructed semantic Resources while yielding comparable task performance and can indeed be used as a proxy for linguistically constructing semantic resources that might not exist for minor languages. Expand
DKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
DKPro Keyphrases is a keyphrase extraction framework based on UIMA. It offers a wide range of state-of-the-art keyphrase experiments approaches. At the same time, it is a workbench for developing newExpand
Analyzing Formulaic Patterns in Historical Corpora
This paper aims to point out a linguistic phenomenon that due to the current stage of research can be analysed only insufficiently with the help of an electronic text corpus. In this way, the paperExpand


UIMA: an architectural approach to unstructured information processing in the corporate research environment
A general introduction to U IMA is given focusing on the design points of its analysis engine architecture and how UIMA is helping to accelerate research and technology transfer is discussed. Expand
An Annotation Type System for a Data-Driven NLP Pipeline
An annotation type system for a data-driven NLP core system that covers formal document structure and document meta information, as well as the linguistic levels of morphology, syntax and semantics is introduced. Expand
Flexible UIMA Components for Information Retrieval Research
A suite of flexible UIMA-based components for information retrieval research which have been successfully used (and re-used) in several projects in different application domains are presented. Expand