Corpus ID: 2300418

DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse

  title={DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse},
  author={R. Eckart de Castilho and Iryna Gurevych},
  • R. Eckart de Castilho, Iryna Gurevych
  • Published 2009
  • Computer Science
  • User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide general-purpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts… CONTINUE READING
    Approaches to Automatic Text Structuring
    • 3
    • Open Access
    Bluima: a UIMA-based NLP Toolkit for Neuroscience
    • 6
    Natural Language Processing: Integration of Automatic and Manual Analysis
    • 4
    • Open Access
    Collaborative Web-Based Tools for Multi-layer Text Annotation
    • 10
    • Open Access


    Publications referenced by this paper.
    UIMA: an architectural approach to unstructured information processing in the corporate research environment
    • 932
    • Highly Influential
    • Open Access
    An Annotation Type System for a Data-Driven NLP Pipeline
    • 25
    • Open Access
    Flexible UIMA Components for Information Retrieval Research
    • 18
    • Open Access