• Corpus ID: 209376338

Common Voice: A Massively-Multilingual Speech Corpus

@inproceedings{Ardila2020CommonVA,
  title={Common Voice: A Massively-Multilingual Speech Corpus},
  author={Rosana Ardila and Megan Branson and Kelly Davis and Michael Henretty and Michael Kohler and Josh Meyer and Reuben Morais and Lindsay Saunders and Francis M. Tyers and Gregor Weber},
  booktitle={LREC},
  year={2020}
}
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of… 

Figures and Tables from this paper

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
TLDR
The design of this corpus’s design and pipeline make RyanSpeech ideal for developing TTS systems in realworld applications and to provide a baseline for future research, protocols, and benchmarks, 4 state-of-the-art speech models and a vocoder are trained.
The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
TLDR
The legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons’s sponsorship are discussed.
Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset
TLDR
This paper creates a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK, and analyzes the existing datasets according to their speech type, data source, total size and availability.
English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System
TLDR
This work proposes to evaluate a state-of-the-art automatic speech recognition (ASR) deep learning-based model, using unseen data from a corpus with a wide variety of labeled English accents from different countries around the world, and test the accuracy against samples extracted from another public corpus that is continuously growing, the Common Voice dataset (CV).
Improving Language Identification of Accented Speech
TLDR
It is shown that for speech with a non-native or regional accent, the accuracy of spoken language identification systems drops dramatically, and that theuracy of identifying the language is inversely correlated with the strength of the accent.
Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset
TLDR
A powerful and robust Cantonese ASR model is created by applying multi-dataset learning on MDCC and Common Voice zh-HK and the results show the effectiveness of the dataset.
Common Phone: A Multilingual Dataset for Robust Acoustic Modelling
TLDR
This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 11,000 contributors of Mozilla's Common Voice project, and concludes that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
TLDR
SpeechStew is a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal, and it is demonstrated that SpeechStew learns powerful transfer learning representations.
BEA-Base: A Benchmark for ASR of Spontaneous Hungarian
TLDR
BEA-Base, a subset of the BEA spoken Hungarian database comprising mostly spontaneous speech of 140 speakers, is introduced, built specifically to assess ASR, primarily for conversational AI applications.
Multilingual Spoken Words Corpus
TLDR
This work generates this dataset by applying forced alignment on crowdsourced sentence-level audio to produce per-word timing estimates for extraction, and reports baseline accuracy metrics on keyword spotting models trained from this dataset compared to models trained on a manually-recorded keyword dataset.
...
...

References

SHOWING 1-10 OF 11 REFERENCES
Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED
TLDR
Using comparable systems over the five Option Period 1 languages indicates a strong correlation between ASR performance and KWS performance, and the approaches described show consistent trends over the languages investigated to date.
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech
TLDR
A multi-dimensional extension of the Dynamic Programming solution to Levenshtein Edit Distance calculations capable of evaluating STT systems during periods of overlapping, simultaneous speech.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
TLDR
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Voxforge
  • http://www. voxforge.org/. accessed 11/25/2019.
  • 2019
Multi-Task and Transfer Learning in Low-Resource Speech Recognition
The m-ailabs speech dataset
  • 2019
Sharing our common voices mozilla releases the largest to-date public domain transcribed voice dataset, Feb
  • https://blog.mozilla.org/blog/2019/02/28/sharingour-common-voices-mozilla-releases-the-largest-to-
  • 2019
Sharing our common voices mozilla releases the largest to - date public domain transcribed voice dataset
  • 2019
...
...