Speech Recognition by Simply Fine-Tuning Bert

  title={Speech Recognition by Simply Fine-Tuning Bert},
  author={Wen-Chin Huang and Chia-Hua Wu and Shang-Bao Luo and Kuan-Yu Chen and Hsin-Min Wang and Tomoki Toda},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Wen-Chin Huang, Chia-Hua Wu, +3 authors T. Toda
  • Published 30 January 2021
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech… 

Figures and Tables from this paper

On the differences between BERT and MT encoder spaces and how to address them in translation tasks
Light is shed on the embedding spaces they create, using average cosine similarity, contextuality metrics and measures for representational similarity for comparison, revealing that BERT and NMT encoder representations look significantly different from one another.
Non-autoregressive Transformer-based End-to-end ASR using BERT
A non-autoregressive transformer-based end-to-end ASR model based on BERT is presented and a series of experiments conducted on the AISHELL-1 dataset demonstrates competitive or superior results of the proposed model when compared to state-of-the-art ASR systems.


Effective Sentence Scoring Method Using BERT for Speech Recognition
An effective sentence scoring method is proposed by adjusting the BERT to the n-best list rescoring task, which has no fine-tuning step and empirically prove that the left and right representations should be fused in biLMs for scoring a sentence.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
The Kaldi Speech Recognition Toolkit
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Incorporating BERT into Neural Machine Translation
A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms.
CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition
  • Linhao Dong, Bo Xu
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A novel soft and monotonic alignment mechanism used for sequence transduction inspired by the integrate-and-fire model in spiking neural networks and employed in the encoder-decoder framework consists of continuous functions, thus being named as Continuous Integrate- and-Fire (CIF).
Distilling the Knowledge of BERT for Sequence-to-Sequence ASR
This work leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation, and outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.
Improving Language Understanding by Generative Pre-Training
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
  • Hui Bu, Jiayu Du, X. Na, Bengu Wu, Hao Zheng
  • Computer Science
    2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)
  • 2017
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition
Triggered Attention for End-to-end Speech Recognition
The proposed triggered attention (TA) decoder concept achieves similar or better ASR results in all experiments compared to the full-sequence attention model, while also limiting the decoding delay to two look-ahead frames, which in this setup corresponds to an output delay of 80 ms.