DeRiK: A German reference corpus of computer-mediated communication

  title={DeRiK: A German reference corpus of computer-mediated communication},
  author={Michael Bei{\ss}wenger and Maria Ermakova and Alexander Geyken and Lothar Lemnitzer and Angelika Storrer},
  booktitle={Lit. Linguistic Comput.},
The paper describes an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an already existing reference corpus of written contemporary German. The ‘Deutsches Referenzkorpus zur internetbasierten Kommunikation’ (DeRiK) shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a gap in the coverage of the corpus resources in the project “Digitales Worterbuch der deutschen Sprache… 

Figures from this paper

A TEI Schema for the Representation of Computer-mediated Communication

The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for the

TEI across corpora, languages and genres: Towards a standard for the representation of social media and computer-mediated communication

The panel presents results and ongoing work from corpus projects in which TEI-P5 has been adopted for the representation and linguistic annotation of genres of social media and computer-mediated communication (CMC) on the example of German and French CMC corpora.

Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project

The project DiDi collects and analyzes German data of computer-mediated communication written by internet users from the Italian province of Bolzano – South Tyrol, and analyses how L1 German speakers in SouthTyrol use different varieties of German and other languages to communicate on social network sites.

Computer-mediated communication in TEI: What lies ahead

This panel will discuss how the models provided by the TEI encoding framework may be adapted to the special requirements of cmc genres and what might be a practical and reasonable way to go about creating such a standard.

Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects

The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though

Compilation and Annotation of the Discourse-structured Blog Corpus for German

The first results of the compilation and annotation of a blog corpus for German are reported, which are of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques.

Ad hoc and general-purpose corpus construction from web sources. (Construction de corpus généraux et spécialisés à partir du Web)

Why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase are explained.

Types and annotation of reply relations in computer-mediated communication

An annotation proposal is provided that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by theTEI CMC SIG.

Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries.

There already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.

Paper 2 : Expanding the TEI encoding framework to genres of computer-mediated communication : considerations and suggestions

The social web has brought forth various genres of interpersonal communication (computer-mediated communication, henceforth: cmc) such as chats, discussion forums, wiki talk pages, Twitter, comment



A TEI Schema for the Representation of Computer-mediated Communication

The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for the

The DWDS corpus: A reference corpus for the German language of the 20 century

The DWDS corpus, constructed at the Berlin-Brandenburg Academy of Sciences (BBAW) between 2000 and 2003, consists altogether of over a billion words of running text. Corpus building continues to be

Computer-mediated communication : linguistic, social and cross-cultural perspectives

1. Foreword 2. Introduction 3. I. Linguistic Perspectives 4. Electronic Language: A new variety of English (by Collot, Milena) 5. Oral and written linguistic aspects of computer conferencing (by

Lexical and Discourse Analysis of Online Chat Dialog

The purpose of this research is to build a chat corpus, tagged with lexical (token part-of-speech labels), syntactic (post parse tree), and discourse (post classification) information that can be used to develop more complex, statistical-based NLP applications that perform tasks such as author profiling, entity identification, and social network analysis.

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

The interplay between data acquisition and data processing during the creation of the SoNaR Corpus is discussed, which is designed to contain 500 million words, balanced over 36 text types including both traditional and new media texts.

Internet Linguistics: A Student Guide

In this student-friendly guidebook, leading language authority Professor David Crystal follows on from his landmark bestseller Language and the Internet and presents the area as a new field: Internet linguistics.

Language and the Internet

Covering a range of Internet genres, including e-mail, chat, and the Web, this is a revealing account of how the Internet is radically changing the way the authors use language.

A Hybrid Approach to Part-of-Speech Tagging

The dwdst PoS tagging library is described, which makes use of a rule-based morphological component to extend traditional HMM techniques by the inclusion of lexical class probabilities and theoretically motivated search space reduction.

TAGH: A Complete Morphology for German Based on Weighted Finite State Automata

TAGH is a system for automatic recognition of German word forms based on a stem lexicon with allomorphs and a concatenative mechanism for inflection and word formation that was compiled within 5 years on the basis of large newspaper corpora and literary texts.

Das Digitale Wörterbuch der Deutschen Sprache (DWDS)

Es hat die Vollendung erlebt, und auch wieder nicht, denn als im Jahre 1960 die letzte Lieferung des Deutschen Worterbuchs erschien, da war langst deutlich, dass weite Teile dieses gewaltigen Werks