Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean

  title={Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean},
  author={Tae Hwan Oh and Ji Yoon Han and Hyonsu Choe and Seokwon Park and Han He and Jinho D. Choi and Na-Rae Han and Jena D. Hwang and Hansaem Kim},
In this paper, we first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD) and address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations that are more faithful to Korean grammar. For compatibility to the rest of UD corpora, we follow the UDv2 guidelines, and extensively revise the part-of-speech tags and the dependency relations to reflect morphological features and flexible word- order aspects in Korean. The… 
Annotation Issues in Universal Dependencies for Korean and Japanese
To investigate issues that arise in the process of developing a Universal Dependency (UD) treebank for Korean and Japanese, we begin by addressing the typological characteristics of Korean and
KLUE: Korean Language Understanding Evaluation
KLUE is a collection of 8 Korean natural language understanding tasks, including Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking, and a comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future.
K-SNACS: Annotating Korean Adposition Semantics
The SNACS framework is applied to Korean to annotate the highly popular novella The Little Prince with semantic supersense labels over all Korean postpositions and a detailed analysis of the corpus is provided with an apples-to-apples comparison between Korean and English annotations.
Open Korean Corpora: A Practical Report
This work curates and reviews a list of Korean corpora, first describing institution-level resource development, and proposes a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.


Penn Korean Treebank : Development and Evaluation
Some issues in developing the annotation guidelines for POS tagging and syntactic bracketing are discussed, including various methods used to detect and correct annotation errors, and some statistics on the size of the corpus are presented.
Building Universal Dependency Treebanks in Korean
This paper presents three treebanks in Korean that consist of dependency trees derived from existing treebanks, the Google UD Treebank, the Penn Korean Treebank, and the KAIST Treebank, and
Universal Dependencies Version 2 for Japanese
The UD Japanese resources are built based on automatic conversion from several treebanks, and the word delimitation, POS, and syntactic relations of the existing treebanks are ported for the UD annotation scheme.
Universal Dependency Annotation for Multilingual Parsing
A new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean is presented, made freely available in order to facilitate research on multilingual dependency parsing.
Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT
The BERT models outperform the previously best-performing models by 2.5% on average and an in-depth analysis on the impact of BERT embeddings is provided using self-attention, which helps understanding in this rich yet representation.
Deep Biaffine Attention for Neural Dependency Parsing
This paper uses a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels, and shows which hyperparameter choices had a significant effect on parsing accuracy, allowing it to achieve large gains over other graph-based approach.
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
The task and evaluation methodology is defined, how the data sets were prepared, report and analyze the main results, and a brief categorization of the different approaches of the participating systems are provided.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
Coordinate Structures in Universal Dependencies for Head-final Languages
The status in the current Japanese and Korean corpora is described and alternative designs suitable for these languages are proposed.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.