Preliminary Experiments on Unsupervised Word Discovery in Mboshi

  title={Preliminary Experiments on Unsupervised Word Discovery in Mboshi},
  author={Pierre Godard and Gilles Adda and Martine Adda-Decker and A. Allauzen and Laurent Besacier and H{\'e}l{\`e}ne Bonneau-Maynard and Guy-No{\"e}l Kouarata and Kevin L{\"o}ser and Annie Rialland and François Yvon},
The necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the… 

Figures and Tables from this paper

Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville)
This article presents multimodal and parallel data collections in Mboshi, as part of the French-German BULB project. It aims at supporting documentation and providing digital resources for less
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
This paper uses the MaSS multilingual speech corpus for creating 56 bilingual pairs and suggests that incorporating boundary clues extracted from a non-parametric Bayesian model with the attentional word segmentation neural model from Godard et al. (2018) increases their translation and alignment quality, specially for challenging language pairs.
Local Word Discovery for Interactive Transcription
Human expertise and the participation of speech communities are essential factors in the success of technologies for low-resource languages. Accordingly, we propose a new computational task which is
Weakly Supervised Word Segmentation for Computational Language Documentation
The experiments on two very low resource languages (Mboshi and Japhug), whose documentation is still in progress, show that weak supervision can be beneficial to the segmentation quality and open the way for interactive annotation tools for documentary linguists.
Unwritten languages demand attention too! Word discovery with encoder-decoder models
Results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences, close to those obtained with a task-specific Bayesian nonparametric model.
Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings
The results suggest that neural models for speech discretization are difficult to exploit in the setting, and that it might be necessary to adapt them to limit sequence length.
A small Griko-Italian speech translation corpus
This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research, and illustrates its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery.
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
A speech corpus collected during a realistic language documentation process, made up of 5k speech utterances in Mboshi aligned to French text translations, is presented.
Spoken Term Discovery for Language Documentation using Translations
An unsupervised speech-to-translation alignment model is modified and prototype speech segments that match the translation words are obtained, which are in turn used to discover terms in the unlabelled data.
Sparse Transcription
  • Steven Bird
  • Computer Science
    Computational Linguistics
  • 2021
Sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.


Innovative technologies for under-resourced language documentation: The BULB Project
The project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to
Breaking the Unwritten Language Barrier: The BULB Project
Word segmentation through cross-lingual word-to-phoneme alignment
It is shown that unsupervised learning of word segmentation is more accurate when information of another language is used, and that the new alignment model Model 3P for cross-lingual word-to-phoneme alignment outperforms a state-of-the-art monolingualword segmentation approach.
Inducing Bilingual Lexicons from Small Quantities of Sentence-Aligned Phonemic Transcriptions
It is shown that monolingual and bilingual lexical entries can be learnt with high precision from corpora having just 1k–10k sentences, and why the application of alignment algorithms to the task of documenting endangered languages is important.
Pronunciation Extraction from Phoneme Sequences through Cross-Lingual Word-to-Phoneme Alignment
Analyzing 14 translations in 9 languages to build a dictionary for English shows that the quality of the resulting dictionary is better in case of close vocabulary sizes in source and target language, shorter sentences, more word repetitions, and formal equivalent translations.
Contextual Dependencies in Unsupervised Word Segmentation
Two new Bayesian word segmentation methods are proposed that assume unigram and bigram models of word dependencies respectively, and the bigram model greatly outperforms the unigrams model (and previous probabilistic models), demonstrating the importance of such dependencies forword segmentation.
Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation
A Bayesian semi-supervised Chinese word segmentation model which uses both monolingual and bilingual information to derive a segmentation suitable for MT is proposed and improves a state-of-the-art MT system in a small and a large data environment.
Les relatives possessives en mbochi (C25)
This paper deals with the possessive constructions -- either connective or relative -- in Mbochi (C25), a Bantu language spoken in Congo-Brazzaville. In Mbochi, as in most languages of the same group