Learn More
We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine.(More)
Multilingual terminological resources can be drawn from parallel corpora in the languages of interest, possibly exploiting machine translation solutions for term identification. This main objective of the CLEF-ER challenge involves parallel corpora in English and other languages. The challenge organisers have gathered and normalized documents from the(More)
We propose a pipelined system for the automatic classification of medical documents according to their language (English, Spanish and German) and their target user group (medical experts vs. health care consumers). We use a simple n-gram based categorization model and present experimental results for both classification tasks. We also demonstrate how this(More)
In this paper, we describe our efforts to build on WORDNET resources, using WORDNET lexical data, the data format that it comes with and WORDNET's software infrastructure in order to generate a biomedical extension of WORDNET, the BIOWORDNET. We began our efforts on the assumption that the software resources were stable and reliable. In the course of our(More)
We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab, which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference(More)
We introduce an interlingua-based approach to cross-language information retrieval, in which queries, as well as documents, are mapped onto a language-independent concept layer on which retrieval operations are performed. This approach is contrasted with one which directly translates non-English queries (German and Portuguese, in our experiments) to English(More)
Up until now, crucial life science information resources, whether biblio-graphic or factual databases, are isolated from each other. Moreover, semantic meta-data intended to structure their contents is supplied in a manual form only. In the StemNet project we aim at developing a framework for semantic interoperability for these resources. This will(More)