DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data

@inproceedings{Daxenberger2014DKProTA,
  title={DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data},
  author={Johannes Daxenberger and Oliver Ferschke and Iryna Gurevych and Torsten Zesch},
  booktitle={ACL},
  year={2014}
}
We present DKPro TC, a framework for supervised learning experiments on textual data. [] Key Method It ships with standard feature extraction modules, while at the same time allowing the user to add customized extractors. The extensive reporting and logging facilities make DKPro TC experiments fully replicable.

Tables from this paper

DeepTC - An Extension of DKPro Text Classification for Fostering Reproducibility of Deep Learning Experiments
TLDR
A deep learning extension for the multi-purpose text classification framework DKPro Text Classification, which does not allow integration of deep learning, is presented and convenience features that take care of repetitive steps, such as pre-processing, data vectorization and pruning of embeddings are provided.
Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse
TLDR
Novel features that exploit clustering of unlabeled data from debate portals based on a word embeddings representation are proposed that significantly outperform several baselines in the cross-validation, cross-domain, and cross-register evaluation scenarios.
ESCRITO - An NLP-Enhanced Educational Scoring Toolkit
We propose ESCRITO, a toolkit for scoring student writings using NLP techniques that addresses two main user groups: teachers and NLP researchers. Teachers can use a high-level API in the teacher
Mass Collaboration on the Web: Textual Content Analysis by Means of Natural Language Processing
TLDR
This chapter describes perspectives for utilizing natural language processing (NLP) to analyze artifacts arising from mass collaboration on the web and introduces recent advances and ongoing efforts to analyze textual content in two web-based resources of mass collaboration with the help of NLP.
Knowledge Discovery in Scientific Literature
TLDR
Novel methods and techniques in the Knowledge Discovery in Scientific Literature (KDSL) research program are developed, all methods developed are applied to the same set of freely available scientific articles.
Leveraging Lexical-Semantic Knowledge for Text Classification Tasks
TLDR
This thesis proposes to address the synonymy problem by automatically enriching the training and testing data with conceptual annotations accessible through lexical-semantic resources, and shows that such conceptual information, in combination with the previous word sense disambiguation step, helps to build more robust classifiers and improves classification performance of multiple tasks.
Proceedings of the 10th Web as Corpus Workshop, WAC@ACL 2016, Berlin, August 12, 2016
TLDR
Preliminary results from an ongoing experiment wherein two large unstructured text corpora are classified by topic domain (or subject area) are described, indicating that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
Automatically Detecting Corresponding Edit-Turn-Pairs in Wikipedia
TLDR
This study analyzes links between edits in Wikipedia articles and turns from their discussion page to better understand implicit details about the writing process and knowledge flow in collaboratively created resources.
Towards a Gold Standard Corpus for Variable Detection and Linking in Social Science Publications
TLDR
The effort to create a new corpus for the evaluation of detecting and linking so-called survey variables in social science publications is described and the annotated corpus is made available along with an open-source baseline system for variable mention identification and linking.
FlexTag: A Highly Flexible PoS Tagging Framework
TLDR
FlexTag makes it easy to quickly develop custom-made taggers exactly fitting the research problem, in contrast to monolithic implementations that can only be retrained but not adapted otherwise.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 13 REFERENCES
A lightweight framework for reproducible parameter sweeping in information retrieval
TLDR
The framework for dataflow-based parameter sweeping experiments introduced in this paper is lightweight, provides support for declaratively setting up experiments, and integrates seamlessly with Java-based development environments.
Development and Analysis of NLP Pipelines in Argo
TLDR
Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows based upon UIMA, is demonstrated, which allows users to seamlessly connect their tools to workflows running in Argo, and take advantage of both the available library of components and the analytical tools.
UIMA: an architectural approach to unstructured information processing in the corporate research environment
TLDR
A general introduction to U IMA is given focusing on the design points of its analysis engine architecture and how UIMA is helping to accelerate research and technology transfer is discussed.
Multi-instance Multi-label Learning for Relation Extraction
TLDR
This work proposes a novel approach to multi-instance multi-label learning for RE, which jointly models all the instances of a pair of entities in text and all their labels using a graphical model with latent variables that performs competitively on two difficult domains.
Scikit-learn: Machine Learning in Python
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing
Natural Language Processing with Python
This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic
The WEKA data mining software: an update
TLDR
This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
TLDR
A tagset is developed, data is annotated, features are developed, and results nearing 90% accuracy are reported on the problem of part-of-speech tagging for English data from the popular micro-blogging service Twitter.
Mining Multi-label Data
A large body of research in supervised learning deals with the analysis of single-label data, where training examples are associated with a single label λ from a set of disjoint labels L. However,
Offspring from Reproduction Problems: What Replication Failure Teaches Us
TLDR
This work presents two concrete use cases involving key techniques in the NLP domain for which it is shown that reproducing results is still difficult and that more care should be taken in interpreting results.
...
1
2
...