• Corpus ID: 24551110

The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation

@inproceedings{Sang2017TheCS,
  title={The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation},
  author={Erik Tjong Kim Sang and Marcel Bollmann and Remko Boschker and Francisco Casacuberta and Feike Dietz and Stefanie Dipper and Miguel Domingo and Rob van der Goot and Marjo van Koppen and Nikola Ljubesic and Robert {\"O}stling and Florian Petran and Eva Pettersson and Yves Scherrer and Marijn Schraagen and Leen Sevens and J{\"o}rg Tiedemann and Tom Vanallemeersch and Kalliopi Zervanou},
  year={2017}
}
The CLIN27 shared task evaluates the effect of translating historical text to modern text with the goal of improving the quality of the output of contemporary natural language processing tools appl ... 

Tables from this paper

A Machine Translation Approach for Modernizing Historical Documents Using Backtranslation
TLDR
In this work, several machine translation approaches for modernizing historical documents are proposed and tested in different scenarios, obtaining very encouraging results.
Automatic Phrase Recognition in Historical German
TLDR
The evaluation shows that the unlexicalized parser outperforms the sequence labeling approach, achieving F1-scores of 87%–91% on modern German and between 73% and 85% on different historical corpora.
How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts
TLDR
This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy.
Neural text normalization with adapted decoding and POS features
TLDR
A novel solution for normalizing Swiss German WhatsApp messages using the encoder–decoder neural machine translation (NMT) framework is proposed, enhancing the performance of a plain character-level NMT model with the integration of a word-level language model and linguistic features in the form of part-of-speech (POS) tags.
Modernizing Historical Documents: a User Study
An Interactive Machine Translation Framework for Modernizing Historical Documents
TLDR
This work proposes a collaborative framework in which a scholar can work together with the machine to generate the new version of a historical document, written in the modern version of the document's original language.
Normalization with Adapted Decoding and PoS Features
  • Computer Science
  • 2018
TLDR
This paper proposes a novel solution for normalizing Swiss German WhatsApp messages using the encoder-decoder neural machine translation (NMT) framework and enhances the performance of a plain character-level NMT model with the integration of a word-level language model and linguistic features (POS tags).
Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change
TLDR
This work empirically tested the Temporal Referencing method for lexical semantic change and showed that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset.
The Janes project: language resources and tools for Slovene user generated content
TLDR
The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser.
Latin-Spanish Neural Machine Translation: from the Bible to Saint Augustine
TLDR
This paper builds a Transformer-based Machine Translation system on the Bible parallel corpus and builds a comparable corpus from Saint Augustine texts and their translations to study the domain adaptation case from the Bible texts to Saint Augustine’s works.
...
1
2
...

References

SHOWING 1-10 OF 42 REFERENCES
Improving Part-of-Speech Tagging of Historical Text by First Translating to Modern Text
TLDR
This work test several methods for translating the words in the historical text to modern equivalents before applying the tag assignment tools, and shows that this additional translation step improves the quality of the automatic syntactic analysis.
Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
TLDR
A large number of historical texts are not available in an electronic format, and even if they are, they are unlikely to be suitable for use in an e-book format.
Re-evaluating the Role of Bleu in Machine Translation Research
TLDR
It is shown that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and two significant counterexamples to Bleu’s correlation with human judgments of quality are given.
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Manual and semi-automatic normalization of historical spelling - case studies from Early New High German
TLDR
Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.
Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus
TLDR
The EU Bookshop ― an online service and archive of publications from various European institutions ― is described, which contains a large body of publications in the 24 official of the EU and is used in training SMT models for English, French, German, Spanish, and Latvian.
Parsing early and late modern English corpora
TLDR
The automatic annotation of diachronic corpora at the levels of word-class, lemma, chunks, and dependency syntax is described, evaluated, and improved, showing that despite high noise levels linguistic signals clearly emerge, opening new possibilities for large-scale research of gradient phenomena in language change.
Part-of-Speech Tagging for Historical English
TLDR
It is demonstrated that the Feature Embedding method for unsupervised domain adaptation outperforms word embeddings and Brown clusters, showing the importance of embedding the entire feature space, rather than just individual words.
Guidelines for normalising Early Modern English corpora: Decisions and justifications
TLDR
It is argued that it is important to develop a linguistically meaningful rationale to achieve good results from the normalisation and standardisation process and propose a number of guidelines for normalising corpora.
Normalization of Dutch User-Generated Content
TLDR
A phrase-based machine translation approach to normalize Dutch user-generated content (UGC) using a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction.
...
1
2
3
4
5
...