• Corpus ID: 240070889

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

  title={Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC},
  author={Chanjun Park and Midan Shim and Sugyeong Eo and Seolhwa Lee and Jaehyung Seo and Hyeonseok Moon and Heuiseok Lim},
Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other highresource languages, such as German or Italian. To address this problem, AI Hub recently… 
3 Citations
Policy 2.0 in Ecuador. Analysis of discourse and political communication on Facebook
  • Sociology
  • 2022
Introduction: Social networks are recognized for their impact on users’ decisions. They have been studied to determine their incidence in the political arena. Methodology: This research analyzes the
Política 2.0 en Ecuador. Análisis del discurso y la comunicación política en Facebook
Introducción: Las redes sociales son reconocidas por el impacto en las decisiones de los usuarios, y han sido estudiadas para determinar su incidencia en el ámbito político. Metodología: La presente
BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders
This work considers that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, and proposes two data augmentation methods to improve the model in a low resource environment and confirms the importance of understanding semantics and grammar in ToEIC.


Copied Monolingual Data Improves Low-Resource Neural Machine Translation
We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we
Should we find another model?: Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification
Through comparative experiments with all the tokenization methods currently used in NLP research, ONE-Piece achieves performance comparable to the current Korean-English machine translation state-of-the-art model.
Using Monolingual Data in Neural Machine Translation: a Systematic Study
This paper conducts a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures and introduces new data simulation techniques that are almost as effective, yet much cheaper to implement.
Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus
To improve machine translation performance with low-resource language pairs, this work proposes a method to expand the training data effectively via filtering the pseudo-parallel corpus using a quality estimation based on back-translation.
Improving Back-Translation with Iterative Filtering and Data Selection for Sinhala-English NMT
This work employs Iterative BT, Filtering, and Data selection in Sinhala - English extremely low resource domain-specific translation in order to improve the performance of NMT and shows that by combining these different techniques, an even better result can be obtained.
Ancient Korean Neural Machine Translation
This paper proposes the first ancient Korean neural machine translation model using a Transformer that can improve the efficiency of a translator by quickly providing a draft translation for a number of untranslated ancient documents.
Parallel Corpus Filtering via Pre-trained Language Models
A novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models via Generative Pre-training (GPT) language model as a domain filter to balance data domains and achieves a new state-of-the-art.
Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models
Quantitative results from intrinsic and extrinsic evaluations show that the novel cross-lingual post-training approach outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared tomonolingual training.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
A Program for Aligning Sentences in Bilingual Corpora
This paper will describe a method and a program for aligning sentences based on a simple statistical model of character lengths, which uses the fact that longer sentences in one language tend to be translated into longer sentence in the other language, and that shorter sentences tend to been translated into shorter sentences.