Grzegorz Chrupala

Learn More
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised(More)
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class(More)
Named Entity Recognition is a relatively well-understood NLP task, with many publicly available training resources and software for English. Other languages tend to be underserved in this area. For German, CoNLL-2003 provides training data, but there are no publicly available, ready-to-use tools. We fill this gap and develop a German NER system with(More)
For the slot filling task of TAC KBP 2010 we developed as a system a simple pipeline architecture whose main components are a two-stage retrieval module and a relation extraction module. We use word-cluster features in the system as a method of achieving generalization by exploiting raw text. In the relation extraction module we use distant supervision in(More)
We describe a system for the CoNLL-2004 Shared Task on Semantic Role Labeling (Carreras and Màrquez, 2004a). The system implements a two-layer learning architecture to recognize arguments in a sentence and predict the role they play in the propositions. The exploration strategy visits possible arguments bottom-up, navigating through the clause hierarchy.(More)
We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaningbased linguistic knowledge from the input signal. We carry out an in-depth(More)