Johannes Graën

  • Citations Per Year
Learn More
We discovered several recurring errors in the current version of the Europarl Corpus originating both from theweb site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for(More)
It has been understood for a long time that the semantic content of a combination of two or more words often cannot be derived from the semantics of the single words, but that the use of one particular word imposes restrictions upon others (Firth 1957; Evert 2004, 15–17). The semantics is then either determined by the ruling word, e.g., in the case of light(More)
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging,(More)
We present an interactive interface to explore the properties of intralingual and interlingual association measures. In conjunction, they can be employed for phraseme identification in word-aligned parallel corpora. The customizable component we built to visualize individual results is capable of showing part-of-speech tags, syntactic dependency relations(More)
We present a data-driven approach which exploits word alignment in a large parallel corpus with the objective of identifying those verband adjective-preposition combinations which are difficult for L2 language learners. This allows us, on the one hand, to provide language-specific ranked lists in order to help learners to focus on particularly challenging(More)
The purpose of this paper is to describe a modular framework for text mining that uses Canonical Text Service (CTS) as a data source. By combining standardized functionalities with standardized access to text data, this framework intends to reduce the heterogeneity of workflows in today’s Digital Humanities and act as an important element of a text research(More)
  • 1