DeRiK: A German reference corpus of computer-mediated communication

@inproceedings{Beiwenger2012DeRiKAG,
  title={DeRiK: A German reference corpus of computer-mediated communication},
  author={Michael Bei{\ss}wenger and Maria Ermakova and Alexander Geyken and Lothar Lemnitzer and Angelika Storrer},
  booktitle={Lit. Linguistic Comput.},
  year={2012}
}
The paper describes an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an already existing reference corpus of written contemporary German. The ‘Deutsches Referenzkorpus zur internetbasierten Kommunikation’ (DeRiK) shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a gap in the coverage of the corpus resources in the project “Digitales Worterbuch der deutschen Sprache… Expand
A TEI Schema for the Representation of Computer-mediated Communication
The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for theExpand
TEI across corpora, languages and genres: Towards a standard for the representation of social media and computer-mediated communication
TLDR
The panel presents results and ongoing work from corpus projects in which TEI-P5 has been adopted for the representation and linguistic annotation of genres of social media and computer-mediated communication (CMC) on the example of German and French CMC corpora. Expand
Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project
TLDR
The project DiDi collects and analyzes German data of computer-mediated communication written by internet users from the Italian province of Bolzano – South Tyrol, and analyses how L1 German speakers in SouthTyrol use different varieties of German and other languages to communicate on social network sites. Expand
Computer-mediated communication in TEI: What lies ahead
TLDR
This panel will discuss how the models provided by the TEI encoding framework may be adapted to the special requirements of cmc genres and what might be a practical and reasonable way to go about creating such a standard. Expand
Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Italian and Slovenian projects
The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even thoughExpand
Compilation and Annotation of the Discourse-structured Blog Corpus for German
TLDR
The first results of the compilation and annotation of a blog corpus for German are reported, which are of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques. Expand
Ad hoc and general-purpose corpus construction from web sources. (Construction de corpus généraux et spécialisés à partir du Web)
TLDR
Why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase are explained. Expand
Types and annotation of reply relations in computer-mediated communication
TLDR
An annotation proposal is provided that combines the different levels of description and representation of reply relations and which adheres to the schemas and practices for encoding CMC corpus documents within the TEI framework as defined by theTEI CMC SIG. Expand
Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries.
TLDR
There already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like. Expand
Paper 2 : Expanding the TEI encoding framework to genres of computer-mediated communication : considerations and suggestions
The social web has brought forth various genres of interpersonal communication (computer-mediated communication, henceforth: cmc) such as chats, discussion forums, wiki talk pages, Twitter, commentExpand
...
1
2
3
4
...

References

SHOWING 1-10 OF 25 REFERENCES
A TEI Schema for the Representation of Computer-mediated Communication
The paper presents an XML schema for the representation of genres of computer-mediated communication (CMC) that is compliant with the encoding framework defined by the TEI. It was designed for theExpand
The DWDS corpus: A reference corpus for the German language of the 20 century
The DWDS corpus, constructed at the Berlin-Brandenburg Academy of Sciences (BBAW) between 2000 and 2003, consists altogether of over a billion words of running text. Corpus building continues to beExpand
The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language
TLDR
This paper presents a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog and proposes to normalize this ‘anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Expand
Computer-mediated communication : linguistic, social and cross-cultural perspectives
1. Foreword 2. Introduction 3. I. Linguistic Perspectives 4. Electronic Language: A new variety of English (by Collot, Milena) 5. Oral and written linguistic aspects of computer conferencing (byExpand
Lexical and Discourse Analysis of Online Chat Dialog
TLDR
The purpose of this research is to build a chat corpus, tagged with lexical (token part-of-speech labels), syntactic (post parse tree), and discourse (post classification) information that can be used to develop more complex, statistical-based NLP applications that perform tasks such as author profiling, entity identification, and social network analysis. Expand
Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus
TLDR
The interplay between data acquisition and data processing during the creation of the SoNaR Corpus is discussed, which is designed to contain 500 million words, balanced over 36 text types including both traditional and new media texts. Expand
Internet Linguistics: A Student Guide
TLDR
In this student-friendly guidebook, leading language authority Professor David Crystal follows on from his landmark bestseller Language and the Internet and presents the area as a new field: Internet linguistics. Expand
Language and the Internet
TLDR
Covering a range of Internet genres, including e-mail, chat, and the Web, this is a revealing account of how the Internet is radically changing the way the authors use language. Expand
A Hybrid Approach to Part-of-Speech Tagging
TLDR
The dwdst PoS tagging library is described, which makes use of a rule-based morphological component to extend traditional HMM techniques by the inclusion of lexical class probabilities and theoretically motivated search space reduction. Expand
TAGH: A Complete Morphology for German Based on Weighted Finite State Automata
TLDR
TAGH is a system for automatic recognition of German word forms based on a stem lexicon with allomorphs and a concatenative mechanism for inflection and word formation that was compiled within 5 years on the basis of large newspaper corpora and literary texts. Expand
...
1
2
3
...