AST: Audio Spectrogram Transformer
- Yuan Gong, Yu-An Chung, James R. Glass
- Computer ScienceInterspeech
- 5 April 2021
The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, and an approach to transfer knowledge from ImageNet pretrained ViT to AST is proposed.
Speech database development at MIT: Timit and beyond
- V. Zue, S. Seneff, James R. Glass
- Computer ScienceSpeech Communication
- 1 August 1990
Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams
- Yaodong Zhang, James R. Glass
- Computer Science, EconomicsIEEE Workshop on Automatic Speech Recognition…
- 1 December 2009
An unsupervised learning framework is presented to address the problem of detecting spoken keywords by using segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances and obtaining the keyword detection result.
An Unsupervised Autoregressive Model for Speech Representation Learning
- Yu-An Chung, Wei-Ning Hsu, Hao Tang, James R. Glass
- Computer ScienceInterspeech
- 5 April 2019
Speech representations learned by the proposed unsupervised autoregressive neural model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsuper supervised approaches.
Unsupervised Pattern Discovery in Speech
- A. Park, James R. Glass
- Computer ScienceIEEE Transactions on Audio, Speech, and Language…
- 2008
It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.
Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
- Wei-Ning Hsu, Yu Zhang, James R. Glass
- Computer ScienceNIPS
- 22 September 2017
A factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision by formulating it explicitly within a factorsized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables.
A probabilistic framework for segment-based speech recognition
- James R. Glass
- Computer ScienceComputer Speech and Language
- 1 April 2003
Highway long short-term memory RNNS for distant speech recognition
- Yu Zhang, Guoguo Chen, Dong Yu, K. Yao, S. Khudanpur, James R. Glass
- Computer ScienceIEEE International Conference on Acoustics…
- 30 October 2015
This paper extends the deep long short-term memory (DL-STM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers, and introduces the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control.
What do Neural Machine Translation Models Learn about Morphology?
- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, James R. Glass
- Computer ScienceAnnual Meeting of the Association for…
- 11 April 2017
This work analyzes the representations learned by neural MT models at various levels of granularity and empirically evaluates the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks.
Unsupervised Learning of Spoken Language with Visual Context
- David F. Harwath, A. Torralba, James R. Glass
- Computer ScienceNIPS
- 2016
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented.
...
...