WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit

  title={WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit},
  author={Binbin Zhang and Di Wu and Zhendong Peng and Xingcheng Song and Zhuoyuan Yao and Hang Lv and Linfu Xie and Chao Yang and Fuping Pan and Jianwei Niu},
Recently, we made available WeNet [1], a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which… 

Figures and Tables from this paper

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Fast-U2++, an enhanced version of U2++ to further reduce partial latency, is presented, to output partial results of the bottom layers in its encoder with a small chunk, while using a large chunk in the top layers of its encoding to compensate the performance degradation caused by the small chunk.

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

The proposed approach learns robust representations for audio-to-intent classification and correctly mitigates 93 .

Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit

Wespeaker contains the implementation of scalable data management, state-of-the-art speaker embedding models, loss functions, and scoring back-ends, with highly competitive results achieved by structured recipes which were adopted in the winning systems in several speaker verification challenges.

Towards A Unified Conformer Structure: from ASR to ASV Task

The Conformer architecture from ASR to ASV is modified with very minor changes and Length-Scaled Attention (LSA) method and Sharpness-Aware Minimization (SAM) are adopted to improve model generalization.

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

This paper describes the NPU-ASLP system submitted to the ISCSLP 2022 Magichub Code-Switching ASR Challenge, which achieves 16.87% on mix error rate (MER) on the test set and comes to the 2nd place in the challenge ranking.

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

It is demonstrated that TrimTail is computationally cheap and can be applied online and optimized with any training loss or any model architecture on any dataset without any extra effort by applying it on various end-to-end streaming ASR networks either trained with CTC loss or Transducer loss.

Sequentially Sampled Chunk Conformer or Streaming End-to-End ASR

This paper demonstrates the performance gains from using the sequentially sampled chunk-wise multi-head self-attention (SSC-MHSA) in the Conformer encoder by allowing efficient cross-chunk interactions while keeping linear complexities, and explores taking advantage of chunked convolution to make use of the chunked future context.

LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge

This paper compares and fused the hybrid architecture and two kinds of end-to-end architecture of LeVoice automatic speech recognition systems to track2 of intelligent cockpit speech recognition challenge 2022, a speech recognition task without lim-its on the scope of model size.

MNASR: A Free Speech Corpus For Mongolian Speech Recognition And Accompanied Baselines

  • Yihao WuYonghe WangHui ZhangF. BaoGuanglai Gao
  • Computer Science
    2022 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
  • 2022
The MnASR database is released which contains 345 hours of Mongolian speech signal and the corresponding transcription and speech recognition baselines are made public at the same time.



WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

An open source speech recognition toolkit called WeNet is proposed, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.

Contextual RNN-T For Open Domain ASR

Modifications to the RNN-T model are proposed that allow the model to utilize additional metadata text with the objective of improving performance on Named Entities (WER-NE) for videos with related metadata.

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

A first-pass Recurrent Neural Network Transducer model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency is developed and found that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional models.

Exploring RNN-Transducer for Chinese speech recognition

  • Senmao WangPan ZhouWei ChenJia JiaLei Xie
  • Computer Science
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2019
This paper explores RNN-T for a Chinese large vocabulary continuous speech recognition (LVCSR) task and proposes a new strategy of learning rate decay and finds that adding convolutional layers at the beginning of the network and using ordered data can discard the pre-training process of the encoder without loss of performance.

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

Streaming End-to-end Speech Recognition for Mobile Devices

This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.

Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter

The cascade RNN-T approach surpasses the character-based Rnn-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency, and can be easily used to strengthen the language model ability.

Deliberation Model Based Two-Pass End-To-End Speech Recognition

This work proposes to attend to both acoustics and first-pass hypotheses using a deliberation network and achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set.

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition, and a novel end-to-end label error detection approach is proposed.