Adding Value to CMC Corpora: CLARINification and Part-of-speech Annotation of the Dortmund Chat Corpus
@inproceedings{Beiwenger2015AddingVT, title={Adding Value to CMC Corpora: CLARINification and Part-of-speech Annotation of the Dortmund Chat Corpus}, author={Michael Bei{\ss}wenger and Eric Ehrhardt and Andrea Horbach and Harald L{\"u}ngen and Diana Steffen and Angelika Storrer}, year={2015} }
Michael Beißwenger, Eric Ehrhardt, Andrea Horbach, Harald Lüngen, Diana Steffen, Angelika Storrer 1 TU Dortmund University, Department of German Language and Literature, D–44221 Dortmund 2 Mannheim University, Department of German Philology, D–68131 Mannheim 3 Saarland University, Department of Computational Linguistics and Phonetics, D–66041 Saarbrücken 4 Institute for the German Language, Department of Central Research: Corpus Linguistics, D–68131 Mannheim
Figures from this paper
9 Citations
*Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN
- Computer ScienceKONVENS
- 2016
The pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards.
CMC Corpora in DeReKo
- Computer Science
- 2017
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in…
DRuKoLA – towards contrastive German-Romanian research based on comparable corpora
- Linguistics
- 2016
This paper introduces the recently started DRuKoLA-project that aims at providing mechanisms to flexibly draw virtual comparable corpora from the German Reference Corpus DeReKo and the Reference…
The Janes project: language resources and tools for Slovene user generated content
- Computer ScienceLang. Resour. Evaluation
- 2020
The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser.
Challenges in the Management of Large Corpora and
- Art
- 2018
Many (modernist) works of literature can be understood by their associativeness, be it constructed or “free”. This network-like character of (modernist) literature has often been addressed by terms…
Proceedings of the workshop on challenges in the management of large corpora and big data and natural language processing (CMLC-5+BigNLP) 2017 including the papers from the web-as-corpus (WAC-XI) guest section. Birmingham, 24 july 2017
- Computer Science
- 2017
This paper proposes an experimental and exemplary approach to intraconnect a literary corpus of the Austrian writer Ilse Aichinger with semantic webtechnologies to enable interactive explorations of word-associations.
Challenges in the Management of Large
- Art
- 2018
Many (modernist) works of literature can be understood by their associativeness, be it constructed or “free”. This network-like character of (modernist) literature has often been addressed by terms…
DELIVERABLE SUBMISSION SHEET
- Computer Science
- 2016
The deliverable describes the results of Task 6.4 in WP6 on preliminary evaluations of the PHEME algorithms and their integration, and the scalability of the integrated tools will be evaluated on the large-scale datasets collected inPHEME, as well as on historical data.
IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache
- SociologyDeutsch in Sozialen Medien
- 2020
Der Beitrag untersucht vorhandene Lösungen und neue Möglichkeiten des Korpusausbaus aus Social Mediaund internetbasierter Kommunikation (IBK) für das Deutsche Referenzkorpus (DeReKo). DeReKo ist eine…
References
SHOWING 1-10 OF 17 REFERENCES
EXMARaLDA and the FOLK tools - two toolsets for transcribing and annotating spoken language
- Computer ScienceLREC
- 2012
The paper gives an overview of the individual tools of the two toolsets for transcribing and annotating spoken language: the EXMARaLDA system and the FOLK tools, developed at the Institute for the German Language in Mannheim.
The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres
- Computer ScienceJ. Lang. Technol. Comput. Linguistics
- 2014
The Interaction Space model, a model for automating the automatic annotation of any freely variant elements within the CoMeRe corpora, is presented and issues and decisions made concerning the OpenData perspective are highlighted.
Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication
- Computer ScienceKONVENS
- 2014
It is found that extending a standard training set with small amounts of manually annotated data for Internet texts leads to a substantial improvement of tagger performance, which can be further improved by using a previously proposed method to automatically acquire training data.
Building Linguistic Corpora from Wikipedia Articles and Discussions
- Computer ScienceJ. Lang. Technol. Comput. Linguistics
- 2014
This work built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus, a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets.
TIGER: Linguistic Interpretation of a German Corpus
- Computer Science
- 2004
The TIGER Treebank, a corpus of currently 40,000 syntactically annotated German newspaper sentences, is described and the query language which was designed to facilitate a simple formulation of complex queries is described, a graphical user interface for query input.
Recent Developments in DeReKo
- Computer ScienceLREC
- 2014
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future…
Fast Domain Adaptation for Part of Speech Tagging for Dialogues
- Computer ScienceRANLP
- 2011
This work investigates a fast method for domain adaptation, which provides additional in-domain training data from an unannotated data set by applying POS taggers with different biases to the unannotate data set and then choosing the set of sentences on which the taggers agree.
STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data
- LinguisticsLAW@COLING
- 2014
A recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics is proposed to create a comprehensive reference corpus of spoken German data for the global research community.
Internet Corpora: A Challenge for Linguistic Processing
- Computer ScienceDatenbank-Spektrum
- 2014
A range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts are explored and it is shown that these methods can improve tagger performance substantially.
A TEI Schema for the Representation of Computer-mediated Communication
- Computer Science
- 2012
The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for the…