Marrying Universal Dependencies and Universal Morphology

@inproceedings{McCarthy2018MarryingUD,
  title={Marrying Universal Dependencies and Universal Morphology},
  author={Arya D. McCarthy and Miikka Silfverberg and Ryan Cotterell and Mans Hulden and David Yarowsky},
  booktitle={UDW@EMNLP},
  year={2018}
}
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s… 

Figures and Tables from this paper

UniMorph 2.0: Universal Morphology
TLDR
Advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018 are detailed.
UniMorph 3.0: Universal Morphology
TLDR
Advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018 are detailed.
Variation in Universal Dependencies annotation: A token-based typological case study on adpossessive constructions
In this paper we present a method for identifying and analyzing adnominal possessive constructions in 66 Universal Dependencies treebanks. We classify adpossessive constructions in terms of their
Hebrewnette - A New Derivational Resource for Non-concatenative Morphology: Principles, Design and Implementation
TLDR
The architecture of a derivational database of Modern Hebrew (and more generally of Semitic languages) called Hebrewnette is presented and how the annotations that are used allow us to verify theoretical hypotheses about non-concatenativemorphology is examined.
Lexical databases for computational analyses: A linguistic perspective
TLDR
Some of the methodological challenges and pitfalls involved in using corpora for typological research are surveyed, and a proposal for best practices and directions for further research on the UniMorph database is proposed.
MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology
TLDR
Extending the scope of state-of-the-art multilingual morphological databases, MorphyNet is announced, a high-quality resource with 15 languages, 519k derivational and 10.1M inflectional entries, and a rich set of morphological features.
Sigmorphon 2019 Task 2 system description paper: Morphological analysis in context for many languages, with supervision from only a few
TLDR
This paper presents the UNT HiLT+Ling system for the Sigmorphon 2019 shared Task 2: Morphological Analysis and Lemmatization in Context, which makes minimal use of the supplied training data, in order to be extensible to languages without labeled training data for the morphological inflection task.
Does BERT agree? Evaluating knowledge of structure dependence through agreement relations
TLDR
It is shown that both the single-language and multilingual BERT models capture syntax-sensitive agreement patterns well in general, but it is also highlighted the specific linguistic contexts in which their performance degrades.
Evaluating the Morphosyntactic Well-formedness of Generated Texts
TLDR
This paper proposes L’AMBRE – a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosynthesis rules of the language and shows the effectiveness of the metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning
  • D. Kondratyuk
  • Computer Science
    Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology
  • 2019
TLDR
The CHARLES-SAARLAND system achieves the highest average accuracy and f1 score in morphology tagging and places second in average lemmatization accuracy and it is shown that when paired with additional character-level and word-level LSTM layers, a second stage of fine-tuning on each treebank individually can improve evaluation even further.
...
...

References

SHOWING 1-10 OF 40 REFERENCES
UniMorph 3.0: Universal Morphology
TLDR
Advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018 are detailed.
Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
TLDR
The effort described here resulted in the extraction of a uniquely large normalized resource of nearly 1,000,000 inflectional paradigms across 350 languages, comparable in quantity and quality to data extracted using hand-tuned, language-specific approaches.
A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between Morphology and Syntax
TLDR
A tagger for English is trained that uses syntactic features obtained by automatic parsing to recover complex morphological tags projected from Czech, providing quantitative confirmation of the underlying linguistic hypothesis of equal expressivity, and bodes well for future improvements in downstream HLT tasks including machine translation.
Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging
TLDR
It is shown that additional token constraints can be projected from a resource-rich source language to a resourceful target language via word-aligned bitext, and empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages.
Neural Factor Graph Models for Cross-lingual Morphological Tagging
TLDR
This paper proposes a method for cross-lingual morphological tagging that aims to improve information sharing between languages by relaxing the assumption that tag sets exactly overlap between the HRL and LRL.
Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe
We present an update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We provide
A Language-Independent Feature Schema for Inflectional Morphology
TLDR
This schema is used to universalize data extracted from Wiktionary via a robust multidimensional table parsing algorithm and feature mapping algorithms, yielding 883,965 instantiated paradigms in 352 languages.
HamleDT: Harmonized multi-language dependency treebank
TLDR
It is claimed that transformation procedures can be designed to automatically identify most such phenomena and convert them to a unified annotation style, which is beneficial both to comparative corpus linguistics and to machine learning of syntactic parsing.
A Universal Part-of-Speech Tagset
TLDR
This work proposes a tagset that consists of twelve universal part-of-speech categories and develops a mapping from 25 different treebank tagsets to this universal set, which when combined with the original treebank data produces a dataset consisting of common parts- of-speech for 22 different languages.
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
TLDR
The task and evaluation methodology is defined, how the data sets were prepared, report and analyze the main results, and a brief categorization of the different approaches of the participating systems are provided.
...
...