• Corpus ID: 67855634

Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information

  title={Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information},
  author={Longtu Zhang and Mamoru Komachi},
Unsupervised neural machine translation (UNMT) requires only monolingual data of similar language pairs during training and can produce bi-directional translation models with relatively good performance on alphabetic languages (Lample et al., 2018). However, no research has been done to logographic language pairs. This study focuses on Chinese-Japanese UNMT trained by data containing sub-character (ideograph or stroke) level information which is decomposed from character level data. BLEU scores… 
Inference-only sub-character decomposition improves translation of unseen logographic characters
This work finds that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally, and offers a simple alternative based on decomposition before inference for unseen characters only, which allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.
Korean-to-Japanese Neural Machine Translation System using Hanja Information
This paper proposes a novel method to train a Korean-to-Japanese translation model that focuses on the vocabulary overlap of Korean Hanja words and Japanese Kanji words, and proposes strategies to leverage Hanja information.
UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database
The proposed UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel two-stage coarse-to-fine training approach, is proposed, shedding light on a new path to exploit the homology of languages.
  • 2019
Training neural machine translation models (NMT) requires a large amount of parallel corpus, which is scarce for many language pairs. However, raw non-parallel corpora are often easy to obtain.
Mirror-Generative Neural Machine Translation
The proposed mirror-generative NMT (MGNMT), a single unified architecture that simultaneously integrates the source to target translation model, the target to sourcetranslation model, and two language models, consistently outperforms existing approaches in a variety of scenarios and language pairs, including resource-rich and low-resource languages.
Variational multimodal machine translation with underlying semantic alignment
The proposed variational multimodal translation model is designed as multitask learning in which the shared semantic representation for different modes is learned and the gap among semantic representation from various modes is reduced by incorporating additional constraints.
FSPRM: A Feature Subsequence Based Probability Representation Model for Chinese Word Embedding
A Feature Subsequence based Probability Representation Model (FSPRM) is proposed for learning Chinese word embeddings, in which the morphological and phonetic features of Chinese characters are integrated and their relevance is considered by designing a feature subsequence.
Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin Using Recursive Neural Networks
It is hypothesized that modeling logographs’ structures using recursive neural network should be beneficial, and diagnostic analysis suggests that hierarchical embeddings constructed using treeLSTM is less sensitive to distractors, thus is more robust, especially on complex logographs.


Neural Machine Translation of Logographic Language Using Sub-character Level Information
This study uses a simple approach to improve the performance of NMT systems utilizing decomposed sub-character level information for logographic languages, and indicates that this approach not only improves the translation capabilities of N MT systems between Chinese and English, but also further improves NMT system between China and Japanese.
Improving Neural Machine Translation Models with Monolingual Data
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.
Phrase-Based & Neural Unsupervised Machine Translation
This work investigates how to learn to translate when having access to only large monolingual corpora in each language, and proposes two model variants, a neural and a phrase-based model, which are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters.
Unsupervised Neural Machine Translation
This work proposes a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingUAL corpora alone using a combination of denoising and backtranslation.
Radical Embedding: Delving Deeper to Chinese Radicals
This work proposes a new deep learning technique, called “radical embedding”, with justifications based on Chinese linguistics, and proves its feasibility and utility through a set of three experiments: two in-house standard experiments on short-text categorization and Chinese word segmentation and one in-field experiment on search ranking.
A Challenge Set Approach to Evaluating Machine Translation
This work presents an English-French challenge set approach to translation evaluation and error analysis, and uses it to analyze phrase-based and neural systems, providing a more fine-grained picture of the strengths of neural systems.
Utilizing Visual Forms of Japanese Characters for Neural Review Classification
A novel method is proposed that exploits visual information of ideograms and logograms in analyzing Japanese review documents by first converts font images of Japanese characters into character embeddings using convolutional neural networks and constructs documents based on Hierarchical Attention Networks.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
ASPEC: Asian Scientific Paper Excerpt Corpus
The details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain, are described.
Radical-Based Hierarchical Embeddings for Chinese Sentiment Analysis at Sentence Level
It is proved that radical-level processing could greatly improve sentiment classification performance and two types of Chinese radical-based hierarchical embeddings are proposed that incorporate not only semantics at radical and character level, but also sentiment information.