• Corpus ID: 239009613

Scribosermo: Fast Speech-to-Text models for German and other Languages

  title={Scribosermo: Fast Speech-to-Text models for German and other Languages},
  author={Daniel Bermuth and Alexander Poeppel and Wolfgang Reif},
Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in… 

Figures and Tables from this paper


Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data
This paper presents a way to build an ASR system system for a language even in the absence of any audio training data in that language at all, by simply re-using an existing acoustic model from a phonologically similar language, without any kind of modification or adaptation towards the target language.
Multilingual MLP features for low-resource LVCSR systems
We introduce a new approach to training multilayer perceptrons (MLPs) for large vocabulary continuous speech recognition (LVCSR) in new languages which have only few hours of annotated in-domain
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning
IMS-Speech: A Speech to Text Tool
The IMS-Speech is a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials and is freely available for academic researchers.
Streaming End-to-end Speech Recognition for Mobile Devices
This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.
Multilingual training of deep neural networks
This work investigates multilingual modeling in the context of a DNN - hidden Markov model (HMM) hybrid, where the DNN outputs are used as the HMM state likelihoods and proposes that training the hidden layers on multiple languages makes them more suitable for cross-lingual transfer.
CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages
The development of CSS10 is described, a collection of single speaker speech datasets for ten languages composed of short audio clips from LibriVox audiobooks and their aligned texts, and two neural text-to-speech models are trained.
Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.