Earnings-21: A Practical Benchmark for ASR in the Wild

  title={Earnings-21: A Practical Benchmark for ASR in the Wild},
  author={Miguel Del Rio and Natalie Delworth and Ryan Westerman and Michelle Huang and Nishchal Bhandari and Joseph Palakapilly and Quinten McNamara and Joshua Dong and Piotr Żelasko and Miguel Jette},
Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21 , a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal… 

Figures and Tables from this paper

Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model

An alternate spelling prediction model is proposed that improves recall of rare words by 34.7% relative and of out-of-vocabulary words by 97.2% relative, compared to contextual biasing without alternate spellings.

Toward Zero Oracle Word Error Rate on the Switchboard Benchmark

This work demonstrates major improvements in word error rate (WER) by correcting the reference transcriptions and deviating from the official scoring methodology, and explores using standard-ized scoring tools to compute oracle WER by selecting the best among a list of alternatives.

ArmSpeech: Armenian Spoken Language Corpus

The Armenian language is an independent branch of the Indo-European language family and the official language of the Republic of Armenia and the Republic of Artsakh. According to various reliable

Residual Language Model for End-to-end Speech Recognition

A simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training and directly model the residual factor of the external and internal LMs, namely the residual LM.

Earnings-22: A Practical Benchmark for Accents in the Wild

Earnings-22 provides a free-to-use benchmark of real-world, accented audio to bridge academic and industrial research and examines Individual Word Error Rate (IWER), finding that key speech features impact model performance more for certain accents than others.

ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

This work enhances the toolkit to provide implementations for various SLU benchmarks that enable researchers to seamlessly mix-and-match different ASR and NLU models, and provides pretrained models with intensively tuned hyper-parameters that can match or even outperform the current state-of-the-art performances.

VarArray: Array-Geometry-Agnostic Continuous Speech Separation

VarArray, an array-geometry-agnostic speech separation neural network model that adapts different elements that were proposed before separately, including transform-average-concatenate, conformer speech separation, and inter-channel phase differences, and combines them in an efficient and cohesive way is proposed.

Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis

This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach “publishable” quality and according to the number

ASR4REAL: An extended benchmark for speech models

It is found that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers.

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

The legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons’s sponsorship are discussed.



WER we are and WER we think we are

It is shown that WERs are significantly higher than the best reported results, and a set of guidelines are formulated which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

Preliminaries to a Theory of Speech Disfluencies

Examination of disfluencies in the spontaneous speech of adult normal speakers of American English shows regularities in a variety of dimensions that can help guide and constrain models of spoken language production.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

The Rich Transcription 2007 Meeting Recognition Evaluation

We present the design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation; the fifth in a series of community-wide evaluations of language technologies in the

The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text

The design and implementation of the Fisher protocol for collecting conversational telephone speech which has yielded more than 16,000 English conversations is described and the Quick Transcription specification that allowed 2000 hours of Fisher audio to be transcribed in less than one year is discussed.

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

Densely Connected Networks for Conversational Speech Recognition

It is shown that the proposed dense LSTMs would provide more reliable performance as compared to the conventional, residual LSTm as more LSTM layers are stacked in neural networks.

Neural Machine Translation of Rare Words with Subword Units

This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.