Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

  title={Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences},
  author={Genta Indra Winata and Andrea Madotto and Chien-Sheng Wu and Pascale Fung},
Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel… 

Figures and Tables from this paper

Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods
To deal with the problem of data scarce in training language model (LM) for code-switching (CS) speech recognition, we proposed an approach to obtain augmentation texts from three different
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text
This work adapts a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences to show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text.
A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning
This work proposes an effective deep learning approach for automatically generating the code-mixed text from English to multiple languages without any parallel data, and transfers the knowledge from a neural machine translation to warm-start the training of code- mixed generator.
Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation
This paper investigates data augmentation techniques for synthesizing Dialectal Arabic-English CS text using parallel corpora and alignments where CS points are either randomly chosen or learnt using a sequence-to-sequence model, and finds random-based approach outperforms using trained predictive models on all extrinsic tasks.
Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models
It is reported that the state-of-the-art NLU models are unable to handle code-switching, and it is shown that the closer the languages are, the better the NLU model handles their alternation.
Intrinsic evaluation of language models for code-switching
This paper first put this assumption into question, and observe that alternatively generated sentences could often be linguistically correct when they differ from the ground truth by only one edit, and showed that by using multi-lingual BERT, this paper can achieve better performance than previous work on two code-switching data sets.
Meta-Transfer Learning for Code-Switched Speech Recognition
This work proposes a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting by judiciously extracting information from high-resource monolingual datasets.
Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing
A dependency-free method for generating code-mixed texts from bilingual distributed representations that is competitive with (and in some cases is even superior to) several standard methods under a diverse set of conditions.
IITP-MT at CALCS2021: English to Hinglish Neural Machine Translation using Unsupervised Synthetic Code-Mixed Parallel Corpus
A neural machine translation (NMT) system which is trained on the synthetic code-mixed (cm) English-Hinglish parallel corpus and achieves 10.09 BLEU points over the given test set.
Code-Switching Text Augmentation for Multilingual Speech Processing
This work proposes a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules based on Equivalence Constraint theory while exploiting aligned translation pairs, to generate grammatically valid CS content.


Recurrent neural network language modeling for code switching conversational speech
This paper proposes a structure of recurrent neural networks to predict code-switches based on textual features with focus on Part-of-Speech tags and trigger words and extends the networks by adding POS information to the input layer and by factorizing the output layer into languages.
Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition
This work proposes a first ever code-switch language model for mixed language speech recognition that incorporates syntactic constraints by a code- switch boundary prediction model, acode-switch translation model, and a reconstruction model that is more robust than previous approaches.
Language Modeling with Functional Head Constraint for Code Switching Speech Recognition
This paper proposes to learn the code mixing language model from bilingual data with this constraint in a weighted finite state transducer (WFST) framework and obtains a constrained code switch language model by first expanding the search network with a translation model, and then using parsing to restrict paths to those permissible under the constraint.
Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning
This paper introduces multi-task learning based language model which shares syntax representation of languages to leverage linguistic information and tackle the low resource data issue.
Code-switched Language Models Using Dual RNNs and Same-Source Pretraining
A novel recurrent neural network unit with dual components that focus on each language in the code-switched text separately and Pretraining the LM using synthetic text from a generative model estimated using the training data is proposed.
Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling
A way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity and it is shown that recurrent neural networks and factored language models can be combined using linear interpolation to achieve the best performance.
Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks
This study shows that irrespective of the task or the underlying DNN architecture, the best curriculum for training the code-switched models is to first train a network with monolingual training instances, where each mini-batch has instances from both languages, and then train the resulting network on code- Switched data.
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
A computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory is presented and it is shown that when training examples are sampled appropriately from this synthetic data and presented in certain order, it can significantly reduce the perplexity of an RNN-based language model.
Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese
Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into
Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition
We propose an LSTM-based model with hierarchical architecture on named entity recognition from code-switching Twitter data. Our model uses bilingual character representation and transfer learning to