Metrics for Modeling Code-Switching Across Corpora

@inproceedings{Guzmn2017MetricsFM,
  title={Metrics for Modeling Code-Switching Across Corpora},
  author={Gualberto A. Guzm{\'a}n and Joseph Ricard and Jacqueline Serigos and Barbara E. Bullock and Almeida Jacqueline Toribio},
  booktitle={INTERSPEECH},
  year={2017}
}
In developing technologies for code-switched speech, it would be desirable to be able to predict how much language mixing might be expected in the signal and the regularity with which it might occur. [...] Key Result Applying these metrics to corpora of different languages and genres, we find that they display distinct probabilities and periodicities of switching, information useful for speech processing of mixed-language data.Expand

Figures, Tables, and Topics from this paper

A Survey of Code-switched Speech and Language Processing
TLDR
This survey reviews computational approaches for code-switched Speech and Natural Language Processing, including language processing tools and end-to-end systems and concludes with future directions and open problems in the field. Expand
Automatic Detection of Code-switching Style from Acoustics
TLDR
It is hypothesized that it may be useful for an ASR system to be able to first detect the switching style of a particular utterance from acoustics, and then use specialized language models or other adaptation techniques for decoding the speech. Expand
Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text
TLDR
This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments. Expand
Language Informed Modeling of Code-Switched Text
TLDR
It is hypothesize that encoding language information strengthens a language model by helping to learn code-switching points and is demonstrated that the highest performing model achieves a test perplexity of 19.52 on the CS corpus that was collected and processed. Expand
Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition
TLDR
This paper extracts segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved and shows significant improvements in Hindi-English code-Mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. Expand
Predicting the presence of a Matrix Language in code-switching
TLDR
The results demonstrate that the model can separate some corpora according to whether they have a dominant ML or not but that the corpora span a range of mixing types that cannot be sorted neatly into an insertional vs. alternational dichotomy. Expand
Detecting de minimis Code-Switching in Historical German Books
TLDR
This paper examines the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries, and addresses the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Expand
C L ] 2 A pr 2 01 9 A Survey of Code-switched Speech and Language Processing
Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world. This survey reviewsExpand
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
TLDR
This work presents an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish, and shows that in most tasks, across both language pairs, multilingual models fine-tuned on code- Switched data perform best, showing that mult bilingual models can be further optimized forcode-switching tasks. Expand
Should Code-switching Models Be Asymmetric?
TLDR
The results show that the same constraints on the grammatical junctures and on the directionality of switching hold irrespective of the symmetry of the data, which suggests that insertional C-S may be subsumed under alternational C- S, as spontaneous borrowing. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 41 REFERENCES
Learning to Predict Code-Switching Points
TLDR
Exploratory results on learning to predict potential code-switching points in Spanish-English are presented, using a transcription of code- Switched discourse to evaluate the performance of the classifiers. Expand
Challenges of Computational Processing of Code-Switching
TLDR
This paper addresses challenges of Natural Language Processing on non-canonical multilingual data in which two or more languages are mixed by highlighting and discussing the key problems for each of the tasks with supporting examples from different language pairs and relevant previous work. Expand
Comparing the Level of Code-Switching in Corpora
TLDR
The paper addresses the issues of evaluation and comparison these new corpora entail, by defining an objective measure of corpus level complexity of code-switched texts and showing how this formal measure can be used in practice, by applying it to several code- Switched corpora. Expand
Part-of-Speech Tagging for English-Spanish Code-Switched Text
TLDR
Results on Part-of-Speech tagging Spanish-English code-switched discourse are presented and different approaches to exploit existing resources for both languages are explored that range from simple heuristics, to language identification, to machine learning. Expand
Features for factored language models for code-Switching speech
TLDR
It is found that Brown word clusters, part-of-speech tags and open-class words are most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME. Expand
Simple Tools for Exploring Variation in Code-switching for Linguists
TLDR
The aim of this paper is to quantify and visualize the nature of the integration of languages in CS documents using simple language-independent metrics that can be adopted by linguists. Expand
Speech Synthesis of Code-Mixed Text
TLDR
It is found that there is a significant user preference for TTS systems that can correctly identify and pronounce words in different languages, and a preliminary framework for synthesizing code-mixed text is described. Expand
Syntactic and Semantic Features For Code-Switching Factored Language Models
TLDR
The experimental results reveal that Brown word clusters, part-of-speech tags and open-class words are the most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME. Expand
An Investigation of Code-Switching Attitude Dependent Language Modeling
TLDR
This paper investigates the adaptation of language modeling for conversational Mandarin-English Code-Switching speech and its effect on speech recognition performance and applies recurrent neural network language models which integrate the POS information into the input layer and factorize the output layer into languages for modeling CS. Expand
Moving Code-Switching Research toward More Empirically Grounded Methods
TLDR
This paper offers an automated language identification system and intuitive metrics–the Integration, Burstiness, and Memory indices–that allow us to characterize how corpora are mixed. Expand
...
1
2
3
4
5
...