Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients

  title={Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients},
  author={Huahuan Zheng and Keyu An and Zhijian Ou},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
Neural Architecture Search (NAS), the process of automating architecture engineering, is an appealing next step to advancing end-to-end Automatic Speech Recognition (ASR), replacing expert-designed networks with learned, task-specific architectures. In contrast to early computational-demanding NAS methods, recent gradient-based NAS methods, e.g., DARTS (Differentiable ARchiTecture Search), SNAS (Stochastic NAS) and ProxylessNAS, significantly improve the NAS efficiency. In this paper, we make… 

Figures and Tables from this paper

EfficientTDNN: Efficient Architecture Search for Speaker Recognition
Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.
Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study
This paper systematically compares the performance of three schemes to exploit external single-channel data for multi-channel end-to-end ASR, namely back-end pre-training, data scheduling, and data simulation, under different settings such as the sizes of the single- channel data and the choices of the front-end.
Deformable TDNN with adaptive receptive fields for speech recognition
The latency control mechanism for deformableTDNNs is proposed, which enables deformable TDNNs to do streaming ASR without accuracy degradation, and shows that deformable TDs obtain state-of-the-art results on WSJ benchmarks.
Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers
This paper investigates techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs and suggests that Conformer can improve the recognition performance significantly.
Neural Architecture Search for Speech Emotion Recognition
To accelerate the candidate architecture optimization, a uniform path dropout strategy is proposed to encourage all candidate architecture operations to be equally optimized to improve SER performance.
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks
  • Shou-Yong Hu, Xurong Xie, H. Meng
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A range of neural architecture search techniques are used to automatically learn two types of hyper-parameters of state-of-the-art factored time delay neural networks (TDNNs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer.
Efficient Gradient-Based Neural Architecture Search For End-to-End ASR
This work focuses on applying NAS on the most popular manually designed model: Conformer, and proposes an efficient ASR model searching method that benefits from the natural advantage of differentiable architecture search (Darts) in reducing computational overheads.
Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search
This work proposes a NAS-based ASR framework containing one search space and one differentiable search algorithm called Differentiable Architecture Search(DARTS), which follows the convolution-augmented transformer(Conformer) backbone, which is a more expressive ASR architecture than those used in existing NAS- based ASR frameworks.
Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings
This paper proposes to join phonology driven phone embedding (top-down) and deep neural network (DNN) based acoustic feature extraction (bottom-up) to calculate phone probabilities, and introduces a new method called JoinAP (Joining of Acoustics and Phonology), where no inversion from acoustics to phonological features is required for speech recognition.


Improving End-to-End Speech Recognition with Policy Learning
It is shown that joint training improves relative performance by 4% to 13% for the end-to-end model as compared to the same model learned through maximum likelihood, and with policy learning is able to directly optimize on the (otherwise non-differentiable) performance metric.
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11x faster for decoding than similar systems (e.g. ESPNET).
SNAS: Stochastic Neural Architecture Search
It is proved that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently, and is further augmented with locally decomposable reward to enforce a resource-efficient constraint.
Stochastic Adaptive Neural Architecture Search for Keyword Spotting
A new method called SANAS (Stochastic Adaptive Neural Architecture Search) is proposed which is able to adapt the architecture of the neural network on-the-fly at inference time such that small architectures will be used when the stream is easy to process (silence, low noise, …) and bigger networks will be use when the task becomes more difficult.
Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR
This study investigates flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition and proposes a standalone system, which achieves word error rates comparable with that of the state-of-the-art multi-stage systems while being much faster to prepare.
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
End-to-end Speech Recognition Using Lattice-free MMI
The work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models shows that this approach can achieve comparable results to regular LF-M MI on well-known large vocabulary tasks.
Neural Architecture Search with Reinforcement Learning
This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.
Fully Convolutional Speech Recognition
This paper presents an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling, trained end-to-end to predict characters from theRaw waveform, removing the feature extraction step altogether.
Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale
This paper presents a novel Neural Architecture Search (NAS) framework to improve keyword spotting and spoken language identification models and demonstrates that this approach can automatically design DNNs with an order of magnitude fewer parameters that achieves better performance than the current best models.