Michael Poprat

Learn More
We propose a pipelined system for the automatic classification of medical documents according to their language (English, Spanish and German) and their target user group (medical experts vs. health care consumers). We use a simple n-gram based categorization model and present experimental results for both classification tasks. We also demonstrate how this(More)
In this paper, we describe our efforts to build on WORDNET resources, using WORDNET lexical data, the data format that it comes with and WORDNET's software infrastructure in order to generate a biomedical extension of WORDNET, the BIOWORDNET. We began our efforts on the assumption that the software resources were stable and reliable. In the course of our(More)
We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab, which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference(More)
Multilingual terminological resources can be drawn from parallel corpora in the languages of interest, possibly exploiting machine translation solutions for term identification. This main objective of the CLEF-ER challenge involves parallel corpora in English and other languages. The challenge organisers have gathered and normalized documents from the(More)
We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine.(More)
We introduce an interlingua-based approach to cross-language information retrieval, in which queries, as well as documents, are mapped onto a language-independent concept layer on which retrieval operations are performed. This approach is contrasted with one which directly translates non-English queries (German and Portuguese, in our experiments) to English(More)
One way to exploit the CLEF-ER challenge results is to semi-automatically enrich the multilingual terminology provided to the CLEF-ER participants. In the current version, English is the predominant language (1.8 m synonyms in 531k concepts). Synonyms in other languages are clearly underrepresented (Spanish: 643k, French: 127k, German: 119k and Dutch:(More)