• Publications
  • Influence
Rule-Based Normalization of Historical Texts
TLDR
An unsupervised, rulebased approach which maps historical wordforms to modern wordforms through context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
CorA: A web-based annotation tool for historical and other non-standard language data
We present CorA, a web-based annotation tool for manual annotation of historical and other non-standard language data. It allows for editing the primary data and modifying token boundaries during the
The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation
The CLIN27 shared task evaluates the effect of translating historical text to modern text with the goal of improving the quality of the output of contemporary natural language processing tools appl
Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation
TLDR
An unsupervised, rule-based approach which maps historical wordforms to modern wordforms in the form of context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
Manual and semi-automatic normalization of historical spelling - case studies from Early New High German
TLDR
Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.
Studies for Segmentation of Historical Texts : Sentences or Chunks ?
TLDR
This work uses a machine learning approach to label tokens with their relative positions in text segments using Conditional Random Fields and finds that the task gets easier, the smaller grained the target segments are.
ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access
TLDR
The ReM project builds on several earlier annotation efforts to produce a reference corpus for Middle High German, which consists of around two million tokens and provides a mostly complete collection of written records from Early Middle high German as well as a selection of Middle HighGerman texts from 1200 to 1350.
Aligning the Un-Alignable - A Pilot Study Using a Noisy Corpus of Nonstandardized, Semi-parallel Texts
TLDR
A robust, precision oriented alignment method that deals with a corpus of comparable texts without standardized spelling or sentence boundary marking is presented and is found to outperform the competing one by a great margin.
Geographical Visualization of Search Results in Historical Corpora
TLDR
ANNISVis is a webapp for comparative visualization of geographical distribution of linguistic data, as well as a sample deployment for a corpus of Middle High German texts, which allows the user to formulate multiple ad-hoc queries and visualizes them on a map.
Evaluating Inter-Annotator Agreement on Historical Spelling Normalization
TLDR
A new method to measure inter-annotator agreement for the normalization task integrates common chancecorrected agreement measures, such as Fleiss's κ or Krippendorff's α, and the novelty of the proposed method lies in the way the annotated word forms are treated.