• Corpus ID: 239768221

Lhotse: a speech data representation library for the modern deep learning ecosystem

@article{elasko2021LhotseAS,
  title={Lhotse: a speech data representation library for the modern deep learning ecosystem},
  author={Piotr Żelasko and Daniel Povey and Jan Trmal and Sanjeev Khudanpur},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.12561}
}
Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and data preparation recipes for over 30 popular speech corpora. Various datasets can be easily… 

Figures from this paper

References

SHOWING 1-10 OF 26 REFERENCES
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
TLDR
Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11x faster for decoding than similar systems (e.g. ESPNET).
Librispeech: An ASR corpus based on public domain audio books
TLDR
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Libri-Light: A Benchmark for ASR with Limited or No Supervision
  • Jacob Kahn, M. Rivière, +12 authors Emmanuel Dupoux
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, derived from open-source audio books from the LibriVox project, which is, to the authors' knowledge, the largest freely-available corpus of speech.
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of
MLS: A Large-Scale Multilingual Dataset for Speech Research
TLDR
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research and believes such a large transcribed dataset will open new avenues in ASR and Text-To-Speech research.
The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
TLDR
The legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons’s sponsorship are discussed.
ESPnet: End-to-End Speech Processing Toolkit
TLDR
A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained.
The Kaldi Speech Recognition Toolkit
TLDR
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings
TLDR
Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.
The AMI meeting corpus
TLDR
The corpus is being distributed using a web server designed to allow convenient browsing and download of multimedia content and associated annotations, as well as data collection, annotation and distribution.
...
1
2
3
...