Learn More
We present our ongoing work on handling spelling variations in Old Swedish texts, which lack a standardized orthogra-phy. Words in the texts are matched to lexica by edit distance. We compare manually compiled substitution rules with rules automatically derived from spelling variants in a lexicon. A pilot evaluation showed that the second approach gives(More)
This paper details the design of the lexical and syntactic layers of a new annotated corpus of Swedish contemporary texts. In order to make the corpus adaptable into a variety of representations, the annotation is of a hybrid type with head-marked constituents and function-labeled edges, and with a rich annotation of non-local dependencies. The source(More)
We describe the word sense annotation layer in Eukalyptus, a freely available five-domain corpus of contemporary Swedish with several annotation layers. The annotation uses the SALDO lexicon to define the sense inventory, and allows word sense annotation of compound segments and multiword units. We give an overview of the new annotation tool developed for(More)
In this paper we describe and evaluate a tool for paradigm induction and lexicon extraction that has been applied to Old Swedish. The tool is semi-supervised and uses a small seed lexicon and unannotated corpora to derive full inflection tables for input lemmata. In the work presented here, the tool has been modified to deal with the rich spelling variation(More)
While corpus linguistics has a long tradition of extensive empirical studies, such work is more scarce in historical linguistics. During the last couple of years there has been an increased interest in the potential of language technology tools for other disciplines than modern linguistic research, for instance social, educational and historical studies(More)
Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word(More)
We present results on part-of-speech and morphological tagging for Old Swedish (1225–1526). In a set of experiments we look at the difference between within-corpus and across-corpus accuracy, and explore ways of mitigating the effects of variation and data sparseness by adding different types of dictionary information. Combining several methods, together(More)
  • 1