• Corpus ID: 54878463

Adding Value to CMC Corpora: CLARINification and Part-of-speech Annotation of the Dortmund Chat Corpus

@inproceedings{Beiwenger2015AddingVT,
  title={Adding Value to CMC Corpora: CLARINification and Part-of-speech Annotation of the Dortmund Chat Corpus},
  author={Michael Bei{\ss}wenger and Eric Ehrhardt and Andrea Horbach and Harald L{\"u}ngen and Diana Steffen and Angelika Storrer},
  year={2015}
}
Michael Beißwenger, Eric Ehrhardt, Andrea Horbach, Harald Lüngen, Diana Steffen, Angelika Storrer 1 TU Dortmund University, Department of German Language and Literature, D–44221 Dortmund 2 Mannheim University, Department of German Philology, D–68131 Mannheim 3 Saarland University, Department of Computational Linguistics and Phonetics, D–66041 Saarbrücken 4 Institute for the German Language, Department of Central Research: Corpus Linguistics, D–68131 Mannheim 

Figures from this paper

*Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN
TLDR
The pipeline to integrate CMC and SM corpora into the CLARIN-D corpus infrastructure was developed by transforming an existing CMC corpus, the Dortmund Chat Corpus, into a resource conforming to current technical and legal standards.
CMC Corpora in DeReKo
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in
DRuKoLA – towards contrastive German-Romanian research based on comparable corpora
This paper introduces the recently started DRuKoLA-project that aims at providing mechanisms to flexibly draw virtual comparable corpora from the German Reference Corpus DeReKo and the Reference
The Janes project: language resources and tools for Slovene user generated content
TLDR
The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser.
Challenges in the Management of Large Corpora and
Many (modernist) works of literature can be understood by their associativeness, be it constructed or “free”. This network-like character of (modernist) literature has often been addressed by terms
Proceedings of the workshop on challenges in the management of large corpora and big data and natural language processing (CMLC-5+BigNLP) 2017 including the papers from the web-as-corpus (WAC-XI) guest section. Birmingham, 24 july 2017
TLDR
This paper proposes an experimental and exemplary approach to intraconnect a literary corpus of the Austrian writer Ilse Aichinger with semantic webtechnologies to enable interactive explorations of word-associations.
Challenges in the Management of Large
Many (modernist) works of literature can be understood by their associativeness, be it constructed or “free”. This network-like character of (modernist) literature has often been addressed by terms
DELIVERABLE SUBMISSION SHEET
TLDR
The deliverable describes the results of Task 6.4 in WP6 on preliminary evaluations of the PHEME algorithms and their integration, and the scalability of the integrated tools will be evaluated on the large-scale datasets collected inPHEME, as well as on historical data.
IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache
Der Beitrag untersucht vorhandene Lösungen und neue Möglichkeiten des Korpusausbaus aus Social Mediaund internetbasierter Kommunikation (IBK) für das Deutsche Referenzkorpus (DeReKo). DeReKo ist eine

References

SHOWING 1-10 OF 17 REFERENCES
EXMARaLDA and the FOLK tools - two toolsets for transcribing and annotating spoken language
TLDR
The paper gives an overview of the individual tools of the two toolsets for transcribing and annotating spoken language: the EXMARaLDA system and the FOLK tools, developed at the Institute for the German Language in Mannheim.
The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres
TLDR
The Interaction Space model, a model for automating the automatic annotation of any freely variant elements within the CoMeRe corpora, is presented and issues and decisions made concerning the OpenData perspective are highlighted.
Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication
TLDR
It is found that extending a standard training set with small amounts of manually annotated data for Internet texts leads to a substantial improvement of tagger performance, which can be further improved by using a previously proposed method to automatically acquire training data.
Building Linguistic Corpora from Wikipedia Articles and Discussions
TLDR
This work built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus, a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets.
TIGER: Linguistic Interpretation of a German Corpus
TLDR
The TIGER Treebank, a corpus of currently 40,000 syntactically annotated German newspaper sentences, is described and the query language which was designed to facilitate a simple formulation of complex queries is described, a graphical user interface for query input.
Recent Developments in DeReKo
This paper gives an overview of recent developments in the German Reference Corpus DeReKo in terms of growth, maximising relevant corpus strata, metadata, legal issues, and its current and future
Fast Domain Adaptation for Part of Speech Tagging for Dialogues
TLDR
This work investigates a fast method for domain adaptation, which provides additional in-domain training data from an unannotated data set by applying POS taggers with different biases to the unannotate data set and then choosing the set of sentences on which the taggers agree.
STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data
TLDR
A recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics is proposed to create a comprehensive reference corpus of spoken German data for the global research community.
Internet Corpora: A Challenge for Linguistic Processing
TLDR
A range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts are explored and it is shown that these methods can improve tagger performance substantially.
A TEI Schema for the Representation of Computer-mediated Communication
The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for the
...
...