Preliminary Experiments on Unsupervised Word Discovery in Mboshi
@inproceedings{Godard2016PreliminaryEO, title={Preliminary Experiments on Unsupervised Word Discovery in Mboshi}, author={Pierre Godard and Gilles Adda and Martine Adda-Decker and A. Allauzen and Laurent Besacier and H{\'e}l{\`e}ne Bonneau-Maynard and Guy-No{\"e}l Kouarata and Kevin L{\"o}ser and Annie Rialland and François Yvon}, booktitle={INTERSPEECH}, year={2016} }
The necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the…
17 Citations
Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville)
- LinguisticsLREC
- 2018
This article presents multimodal and parallel data collections in Mboshi, as part of the French-German BULB project. It aims at supporting documentation and providing digital resources for less…
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
- Computer ScienceSLTU
- 2020
This paper uses the MaSS multilingual speech corpus for creating 56 bilingual pairs and suggests that incorporating boundary clues extracted from a non-parametric Bayesian model with the attentional word segmentation neural model from Godard et al. (2018) increases their translation and alignment quality, specially for challenging language pairs.
Local Word Discovery for Interactive Transcription
- LinguisticsEMNLP
- 2021
Human expertise and the participation of speech communities are essential factors in the success of technologies for low-resource languages. Accordingly, we propose a new computational task which is…
Weakly Supervised Word Segmentation for Computational Language Documentation
- Computer Science, LinguisticsACL
- 2022
The experiments on two very low resource languages (Mboshi and Japhug), whose documentation is still in progress, show that weak supervision can be beneficial to the segmentation quality and open the way for interactive annotation tools for documentary linguists.
Unwritten languages demand attention too! Word discovery with encoder-decoder models
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
Results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences, close to those obtained with a task-specific Bayesian nonparametric model.
Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings
- Computer ScienceArXiv
- 2021
The results suggest that neural models for speech discretization are difficult to exploit in the setting, and that it might be necessary to adapt them to limit sequence length.
A small Griko-Italian speech translation corpus
- Computer Science, LinguisticsSLTU
- 2018
This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research, and illustrates its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery.
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
- Computer Science, LinguisticsLREC
- 2018
A speech corpus collected during a realistic language documentation process, made up of 5k speech utterances in Mboshi aligned to French text translations, is presented.
Spoken Term Discovery for Language Documentation using Translations
- Computer Science, LinguisticsSCNLP@EMNLP 2017
- 2017
An unsupervised speech-to-translation alignment model is modified and prototype speech segments that match the translation words are obtained, which are in turn used to discover terms in the unlabelled data.
Sparse Transcription
- Computer ScienceComputational Linguistics
- 2021
Sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.
References
SHOWING 1-10 OF 32 REFERENCES
Innovative technologies for under-resourced language documentation: The BULB Project
- Linguistics, Computer Science
- 2016
The project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to…
Word segmentation through cross-lingual word-to-phoneme alignment
- Linguistics, Computer Science2012 IEEE Spoken Language Technology Workshop (SLT)
- 2012
It is shown that unsupervised learning of word segmentation is more accurate when information of another language is used, and that the new alignment model Model 3P for cross-lingual word-to-phoneme alignment outperforms a state-of-the-art monolingualword segmentation approach.
A Bayesian framework for word segmentation: Exploring the effects of context
- Computer ScienceCognition
- 2009
Inducing Bilingual Lexicons from Small Quantities of Sentence-Aligned Phonemic Transcriptions
- Linguistics, Computer Science
- 2015
It is shown that monolingual and bilingual lexical entries can be learnt with high precision from corpora having just 1k–10k sentences, and why the application of alignment algorithms to the task of documenting endangered languages is important.
Pronunciation Extraction from Phoneme Sequences through Cross-Lingual Word-to-Phoneme Alignment
- Linguistics, Computer ScienceSLSP
- 2013
Analyzing 14 translations in 9 languages to build a dictionary for English shows that the quality of the resulting dictionary is better in case of close vocabulary sizes in source and target language, shorter sentences, more word repetitions, and formal equivalent translations.
Contextual Dependencies in Unsupervised Word Segmentation
- Computer ScienceACL
- 2006
Two new Bayesian word segmentation methods are proposed that assume unigram and bigram models of word dependencies respectively, and the bigram model greatly outperforms the unigrams model (and previous probabilistic models), demonstrating the importance of such dependencies forword segmentation.
Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation
- Computer ScienceCOLING
- 2008
A Bayesian semi-supervised Chinese word segmentation model which uses both monolingual and bilingual information to derive a segmentation suitable for MT is proposed and improves a state-of-the-art MT system in a small and a large data environment.
Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment
- Linguistics, Computer ScienceComput. Speech Lang.
- 2016
Les relatives possessives en mbochi (C25)
- Linguistics
- 2010
This paper deals with the possessive constructions -- either connective or relative -- in Mbochi (C25), a Bantu language spoken in Congo-Brazzaville. In Mbochi, as in most languages of the same group…