Learn More
Attention paid to spoken language has increased in the last decades, as well as its importance for linguistic research and natural language processing in general. However, compilation of spoken corpora as an indispensable source of data is very laborious and thus expensive. Nevertheless, more and more spoken corpora are being created currently. There are(More)
The paper overviews the SYN series of synchronic corpora of written Czech compiled within the framework of the Czech National Corpus project. It describes their design and processing with a focus on the annotation, i.e. lemmatization and morphological tagging. The paper also introduces SYN2013PUB, a new 935-million newspaper corpus of Czech published in(More)
The last two decades have seen the development of various semantic lexical resources such as WordNet (Miller, 1995) and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Recently, increasing efforts have been devoted to extending the semantic frameworks(More)
The paper presents data repository that will be used as a source of data for ORAL2013, a new corpus of spontaneous spoken Czech. The corpus is planned to be published in 2013 within the framework of the Czech National Corpus and it will contain both the audio recordings and their transcriptions manually aligned with time stamps. The corpus will be designed(More)
The paper presents ORAL2008, a new 1-million corpus of spoken Czech compiled within the framework of the Czech National Corpus project. ORAL2008 is designed as a representation of authentic spoken language used in informal situations and it is balanced in the main sociolinguistic categories of speakers. The paper concentrates also on the data collection,(More)
IPTV has been widely deployed throughout the world, bringing significant advantages to users in terms of the channel offering, video on demand, and interactive applications. TV set-top boxes that are deployed in modern IPTV systems can be thought of as capable sensor nodes that collect vast amounts of data, representing both the user activity and the(More)
The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition,(More)
  • 1