WER we are and WER we think we are

  title={WER we are and WER we think we are},
  author={Piotr Szyma'nski and Piotr Żelasko and Mikolaj Morzy and Adrian Szymczak and Marzena Zyla-Hoppe and Joanna Banaszczak and Lukasz Augustyniak and Jan Mizgajski and Yishay Carmiel},
Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB’05 public… 

Figures and Tables from this paper

How Might We Create Better Benchmarks for Speech Recognition?

A versatile framework designed to describe interactions between linguistic variation and ASR performance metrics is introduced, and a taxonomy of speech recognition use cases is outlined, proposed for the next generation of ASR benchmarks.

Using Kaldi for Automatic Speech Recognition of Conversational Austrian German

Improve a Kaldi-based ASR system by incorporating a (large) knowledge-based pronunciation lexicon, while exploring different data-based methods to restrict the number of pronunciation variants for each lexical entry to indicate that for low-resource scenarios – despite the general trend in speech technology towards using data- based methods only – knowledge- based approaches are a successful, efficient method.

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

It is demonstrated that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world data.

Earnings-21: A Practical Benchmark for ASR in the Wild

It is found that ASR accuracy for certain NER categories is poor, present-ing a significant impediment to transcript comprehension and usage.

Evaluating Automatic Speech Recognition Quality and Its Impact on Counselor Utterance Coding

This work analyzes the quality of ASR in the psychotherapy domain, using motivational interviewing conversations between therapists and clients, and empirically studies the effect of mixing ASR and manual data during the training of a downstream NLP model.

SpeechBrain: A General-Purpose Speech Toolkit

The core architecture of SpeechBrain is described, designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines.

A Semi-Automated Live Interlingual Communication Workflow Featuring Intralingual Respeaking: Evaluation and Benchmarking

It is found that the semi-automated workflow combining intralingual respeaking and machine translation is capable of generating outputs that are similar in terms of accuracy and completeness to the outputs produced in the benchmarking workflow, although the small scale of the experiment requires caution in interpreting this result.

Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis

This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach “publishable” quality and according to the number

Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

A principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain and shows its applicability on an under-resourced language by constructing VOXDIY — a counterpart of CROWDSPEECH for the Russian language.

(Commercial) Automatic Speech Recognition as a Tool in Sociolinguistic Research

As speech datasets used in sociolinguistic research increase in size, laborious and time-intensive manual orthographic transcription is a challenge, limiting the amount of (transcribed) data which



Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

ODSQA: Open-Domain Spoken Question Answering Dataset

This paper releases Open-Domain Spoken Question Answering Dataset (ODSQA), the largest real SQA dataset, and finds that ASR errors have catastrophic impact on SQA, and that data augmentation on text-based QA training examples can improve SQA.

The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text

The design and implementation of the Fisher protocol for collecting conversational telephone speech which has yielded more than 16,000 English conversations is described and the Quick Transcription specification that allowed 2000 hours of Fisher audio to be transcribed in less than one year is discussed.

TED-LIUM: an Automatic Speech Recognition dedicated corpus

The content of the corpus, how the data was collected and processed, how it will be publicly available and how an ASR system was built using this data leading to a WER score of 17.4%.

Observations on overlap: findings and implications for automatic processing of multi-party conversation

It is suggested that overlap is an important inherent characteristic of conversational speech that should not be ignored; on the contrary, it should be jointly modeled with acoustic and language model information in machine processing of conversation.

How do we speak with ALEXA

Insight of the participant’s addressee behavior is presented and it could be shown that users could recognize changes in some of their speech characteristics between human-human and human-computer conversations.

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

On the new listening comprehension task, it is found that speech recognition errors have catastrophic impact on machine comprehension, and several approaches are proposed to mitigate the impact.

SWITCHBOARD: telephone speech corpus for research and development

  • J. GodfreyE. HollimanJ. McDaniel
  • Physics, Linguistics
    [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1992
SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500

Enriching ASR Lattices with POS Tags for Dependency Parsing

This paper manipulates the ASR process from the pronouncing dictionary onward to use word-POS pairs instead of words to enrich ASR word lattices, demonstrating a successful lattice-based integration of ASR and POS tagging.

Acoustic Modeling for Overlapping Speech Recognition: Jhu Chime-5 Challenge System

This paper summarizes the acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays, and achieves a word error rate improvement.