Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients
@article{Zheng2021EfficientNA, title={Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients}, author={Huahuan Zheng and Keyu An and Zhijian Ou}, journal={2021 IEEE Spoken Language Technology Workshop (SLT)}, year={2021}, pages={60-67} }
Neural Architecture Search (NAS), the process of automating architecture engineering, is an appealing next step to advancing end-to-end Automatic Speech Recognition (ASR), replacing expert-designed networks with learned, task-specific architectures. In contrast to early computational-demanding NAS methods, recent gradient-based NAS methods, e.g., DARTS (Differentiable ARchiTecture Search), SNAS (Stochastic NAS) and ProxylessNAS, significantly improve the NAS efficiency. In this paper, we make…
9 Citations
EfficientTDNN: Efficient Architecture Search for Speaker Recognition
- Computer Science
- 2021
Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.
Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study
- Computer ScienceArXiv
- 2022
This paper systematically compares the performance of three schemes to exploit external single-channel data for multi-channel end-to-end ASR, namely back-end pre-training, data scheduling, and data simulation, under different settings such as the sizes of the single- channel data and the choices of the front-end.
Deformable TDNN with adaptive receptive fields for speech recognition
- Computer ScienceInterspeech
- 2021
The latency control mechanism for deformableTDNNs is proposed, which enables deformable TDNNs to do streaming ASR without accuracy degradation, and shows that deformable TDs obtain state-of-the-art results on WSJ benchmarks.
Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers
- Computer ScienceArXiv
- 2021
This paper investigates techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs and suggests that Conformer can improve the recognition performance significantly.
Neural Architecture Search for Speech Emotion Recognition
- Computer ScienceArXiv
- 2022
To accelerate the candidate architecture optimization, a uniform path dropout strategy is proposed to encourage all candidate architecture operations to be equally optimized to improve SER performance.
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
A range of neural architecture search techniques are used to automatically learn two types of hyper-parameters of state-of-the-art factored time delay neural networks (TDNNs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer.
Efficient Gradient-Based Neural Architecture Search For End-to-End ASR
- Computer ScienceICMI Companion
- 2021
This work focuses on applying NAS on the most popular manually designed model: Conformer, and proposes an efficient ASR model searching method that benefits from the natural advantage of differentiable architecture search (Darts) in reducing computational overheads.
Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search
- Computer ScienceArXiv
- 2021
This work proposes a NAS-based ASR framework containing one search space and one differentiable search algorithm called Differentiable Architecture Search(DARTS), which follows the convolution-augmented transformer(Conformer) backbone, which is a more expressive ASR architecture than those used in existing NAS- based ASR frameworks.
Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings
- Linguistics, Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
This paper proposes to join phonology driven phone embedding (top-down) and deep neural network (DNN) based acoustic feature extraction (bottom-up) to calculate phone probabilities, and introduces a new method called JoinAP (Joining of Acoustics and Phonology), where no inversion from acoustics to phonological features is required for speech recognition.
References
SHOWING 1-10 OF 37 REFERENCES
Improving End-to-End Speech Recognition with Policy Learning
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
It is shown that joint training improves relative performance by 4% to 13% for the end-to-end model as compared to the same model learned through maximum likelihood, and with policy learning is able to directly optimize on the (otherwise non-differentiable) performance metric.
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11x faster for decoding than similar systems (e.g. ESPNET).
SNAS: Stochastic Neural Architecture Search
- Computer ScienceICLR
- 2019
It is proved that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently, and is further augmented with locally decomposable reward to enforce a resource-efficient constraint.
Stochastic Adaptive Neural Architecture Search for Keyword Spotting
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
A new method called SANAS (Stochastic Adaptive Neural Architecture Search) is proposed which is able to adapt the architecture of the neural network on-the-fly at inference time such that small architectures will be used when the stream is easy to process (silence, low noise, …) and bigger networks will be use when the task becomes more difficult.
Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2018
This study investigates flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition and proposes a standalone system, which achieves word error rates comparable with that of the state-of-the-art multi-stage systems while being much faster to prepare.
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2012
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
End-to-end Speech Recognition Using Lattice-free MMI
- Computer ScienceINTERSPEECH
- 2018
The work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models shows that this approach can achieve comparable results to regular LF-M MI on well-known large vocabulary tasks.
Neural Architecture Search with Reinforcement Learning
- Computer ScienceICLR
- 2017
This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.
Fully Convolutional Speech Recognition
- Computer ScienceArXiv
- 2018
This paper presents an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling, trained end-to-end to predict characters from theRaw waveform, removing the feature extraction step altogether.
Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale
- Computer ScienceINTERSPEECH
- 2019
This paper presents a novel Neural Architecture Search (NAS) framework to improve keyword spotting and spoken language identification models and demonstrates that this approach can automatically design DNNs with an order of magnitude fewer parameters that achieves better performance than the current best models.