SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

@inproceedings{ONeill2021SPGISpeech50,
  title={SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition},
  author={Patrick K. O’Neill and Vitaly Lavrukhin and Somshubra Majumdar and Vahid Noroozi and Yuekai Zhang and Oleksii Kuchaiev and Jagadeesh Balam and Yuliya Dovzhenko and Keenan Freyberg and Michael D. Shulman and Boris Ginsburg and Shinji Watanabe and Georg Kucsko},
  booktitle={Interspeech},
  year={2021}
}
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present in the acoustic signal but absent in transcription. Here we propose a new STT task: endto-end… 

Figures and Tables from this paper

Scaling ASR Improves Zero and Few Shot Learning
TLDR
By training 1-10B parameter universal English ASR models, this work pushes the limits of speech recognition performance across many domains and proposes data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.
Earnings-22: A Practical Benchmark for Accents in the Wild
TLDR
Earnings-22 provides a free-to-use benchmark of real-world, accented audio to bridge academic and industrial research and examines Individual Word Error Rate (IWER), finding that key speech features impact model performance more for certain accents than others.
Lhotse: a speech data representation library for the modern deep learning ecosystem
TLDR
Cut and CutSet concepts are introduced, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances and how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.
Can Self-Supervised Learning solve the problem of child speech recognition?
TLDR
The pretrained wav2vec2 models were finetuned using different amounts of child speech training data to discover the optimum amount of data required to finetune the model for the task of child ASR.
ASR in German: A Detailed Error Analysis
TLDR
This work presents a selection of ASR model architectures that are pretrained on the German language and evaluates them on a benchmark of diverse test datasets, identifying cross-architectural prediction errors, classifying those into categories and tracing the sources of errors back into training data as well as other sources.
Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models
TLDR
Dom adaptation for low-resource Automatic Speech Recognition of target-domain data, when a well-trained ASR model trained with a large dataset is available, is investigated and it is shown that applying Spectral Augmentation on the proposed features provides a further improvement on the target- domain performance.
JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification
TLDR
This paper builds a new Japanese speech corpus called “JTubeSpeech” from YouTube videos and subtitles for speech recognition and speaker verification, and consistently employs Connectionist Temporal Classification-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV).
BPPF 2021 The 1st Workshop on Benchmarking: Past, Present and Future Proceedings of the Workshop
TLDR
The history of how benchmarking became important in Information Retrieval, and then in speech (starting around 1975), andthen in language (in 1988), and how this approach moved from speech to language in this keynote.
Joint Prediction of Truecasing and Punctuation for Conversational Speech in Low-Resource Scenarios
TLDR
This work proposes to use a multi-task system that can exploit the relations between casing and punctuation to improve their prediction performance, and shows that by training the model in the written text domain and then transfer learning to conversations, it can achieve reasonable performance with less data.
Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development
TLDR
A human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets and demonstrates the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of
NoiseQA: Challenge Set Evaluation for User-Centric Question Answering
TLDR
There is substantial room for progress before QA systems can be effectively deployed, the need for QA evaluation to expand to consider real-world use is highlighted, and the findings will spur greater community interest in the issues that arise when the authors' systems actually need to be of utility to humans.
Neural Inverse Text Normalization
TLDR
This work proposes an efficient and robust neural solution for ITN leveraging transformer based seq2seq models and FST-based text normalization techniques for data preparation and shows that this can be easily extended to other languages without the need for a linguistic expert to manually curate them.
Conformer: Convolution-augmented Transformer for Speech Recognition
TLDR
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition
TLDR
This paper demonstrates the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks and shows that in all three cases, transfer learning from a good base model has higher accuracy than a model trained from scratch.
Common Voice: A Massively-Multilingual Speech Corpus
TLDR
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model
In this work, we introduce a simple yet efficient post-processing model for automatic speech recognition. Our model has Transformer-based encoder-decoder architecture which "translates" acoustic
NeMo: a toolkit for building AI applications using Neural Modules
TLDR
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition that provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs.
Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging
  • B. Nguyen, V. H. Nguyen, Luong Chi Mai
  • Computer Science
    2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
  • 2019
TLDR
A method to restore the punctuation and capitalization for long-speech ASR transcription is proposed based on Transformer models and chunk merging that outperforms existing methods in both accuracy and decoding speed.
...
...