• Publications
  • Influence
ESPnet: End-to-End Speech Processing Toolkit
TLDR
A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained.
Joint CTC-attention based end-to-end speech recognition using multi-task learning
TLDR
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
TLDR
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
A Comparative Study on Transformer vs RNN in Speech Applications
TLDR
An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
Attention-Based Multimodal Fusion for Video Description
TLDR
A multimodal attention model that can selectively utilize features from different modalities for each word in the output description is introduced that outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
TLDR
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.
Streaming Automatic Speech Recognition with the Transformer Model
TLDR
This work proposes a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word, and applies time-restricted self-attention for the encoder and triggered attention for theEncoder-decoder attention mechanism.
Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
TLDR
This work learns to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network and beats out traditional hybrid ASR systems on spontaneous Japanese and Chinese speech.
Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition
TLDR
This paper proposes a novel one-pass search algorithm with on-the-fly composition of weighted finite-state transducers (WFSTs) for large-vocabulary continuous-speech recognition and achieves high-accuracy one- pass real-time speech recognition with an extremely large vocabulary of 1.8 million words.
End-to-end Speech Recognition With Word-Based Rnn Language Models
TLDR
A novel word-based RNN-LM is proposed, which allows us to decode with only the word- based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM.
...
...