Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool

  title={Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool},
  author={Utku T{\"u}rk and Furkan Atmaca and Saziye Bet{\"u}l {\"O}zates and G{\"o}zde Berk and Seyyit Talha Bedir and Abdullatif K{\"o}ksal and Balkiz {\"O}zt{\"u}rk Basaran and Tunga G{\"u}ng{\"o}r and Arzucan {\"O}zg{\"u}r},
In this paper, we describe our contributions and efforts to develop Turkish resources, which include a new treebank (BOUN Treebank) with novel sentences, along with the guidelines we adopted and a new annotation tool we developed (BoAT). The manual annotation process we employed was shaped and implemented by a team of four linguists and five NLP specialists. Decisions regarding the annotation of the BOUN Treebank were made in line with the Universal Dependencies framework, which originated from… 
Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics (Dagstuhl Seminar 21351)
The Dagstuhl Seminar 21351 addressed the challenges of Linguistic Idiosyncrasy in Multilingual Computational Linguistics by creating synergies between three distinct though partly overlapping communities: experts in typology, in cross-lingual morphosyntactic annotation and in multiword expressions.
A Language-aware Approach to Code-switched Morphological Tagging
Experimental results show that including language IDs to the learning model significantly improves accuracy over other approaches and this approach for integrating language IDs into a transformer-based framework for CS morphological tagging.
Automatic Lexical Simplification for Turkish
In this paper, we present the first automatic lexical simplification system for the Turkish language. Recent text simplification efforts rely on manually crafted simplified corpora and comprehensive
Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP
MaChAmp is presented, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings and the benefits are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit.


Turkish Treebanking: Unifying and Constructing Efforts
It is demonstrated that the annotation of the TNC-UD improves the parsing accuracy of Turkish, and a custom annotation software with advanced filtering and morphological editing options is constructed.
Improving the Annotations in the Turkish Universal Dependency Treebank
It is observed that the re-annotation of the Turkish IMST-UD treebank improves performance with regards to dependency parsing.
IMST: A Revisited Turkish Dependency Treebank
An attempt at reannotating the treebank from the ground up using the proposed schemes is described, and the consistencies of the two versions of the original treebank are compared via cross-validation using a dependency parser.
A Gold Standard Dependency Treebank for Turkish
T; a new treebank for Turkish which consists of web and Wikipedia sentences that are annotated for segmentation, morphology, part-of-speech and dependency relations and also the results of the baseline experiments on Turkish dependency parsing with this treebank are presented.
Universal Dependencies for Turkish
The findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.
Swedish-Turkish Parallel Treebank
The treebank is a balanced syntactically annotated corpus containing both fiction and technical documents that was developed within the project supporting research environment for minor languages aiming at to create representative language resources for language pairs dissimilar in language structure.
The English-Swedish-Turkish Parallel Treebank
A syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish, is described, used in teaching and linguistic research to study the relationship between the structurally different languages.
The TIGER Treebank
This paper reports on the TIGER Treebank, a corpus of currently 35.000 syntactically annotated German newspaper sentences. We describe what kind of information is encoded in the treebank and
Constructing a Turkish Constituency Parse TreeBank
The words are semi-automatically annotated morphologically and a rule-based approach is used for refining the parse trees based on the morphological analyses of the words.