Low-Dimensional Bottleneck Features for On-Device Continuous Speech Recognition

@inproceedings{Ramsay2018LowDimensionalBF,
  title={Low-Dimensional Bottleneck Features for On-Device Continuous Speech Recognition},
  author={David B. Ramsay and Kevin Kilgour and Dominik Roblek and Matthew Sharifi},
  booktitle={Interspeech},
  year={2018}
}
Low power digital signal processors (DSPs) typically have a very limited amount of memory in which to cache data. In this paper we develop efficient bottleneck feature (BNF) extractors that can be run on a DSP, and retrain a baseline large-vocabulary continuous speech recognition (LVCSR) system to use these BNFs with only a minimal loss of accuracy. The small BNFs allow the DSP chip to cache more audio features while the main application processor is suspended, thereby reducing the overall… 

Figures and Tables from this paper

On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

This paper systematically study the impact of training targets, activation functions, and loss functions on the performance of TD-SV, and experimentally shows that GELU is able to reduce the error rates of TD -SV significantly compared to sigmoid, irrespective of training target.

Feed-Forward Deep Neural Network (FFDNN)-Based Deep Features for Static Malware Detection

The portable executable header (PEH) information is commonly used as a feature for malware detection systems to train and validate machine learning (ML) or deep learning (DL) classifiers to extract deep features through hidden layers of a feed-forward deep neural network (FFDNN).

A Fixed-Point Neural Network Architecture for Speech Applications on Resource Constrained Hardware

This paper designs low cost neural network architectures for keyword detection and speech recognition usingresent techniques to reduce memory requirement by scaling down the precision of weight and biases without compromising on the detection/recognition performance.

Compression of End-to-End Models

This work explores the problem of compressing end-to-end models with the goal of satisfying device constraints without sacrificing model accu-racy and evaluates matrix factorization, knowledge distillation, and parameter sparsity to determine the most effective methods given constraints such as a parameter budget.

Now Playing: Continuous low-power music recognition

A low-power music recognizer that runs entirely on a mobile device and automatically recognizes music without user interaction is presented, which respects user privacy by running entirely on-device and can passively recognize a wide range of music.

On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition

This work presents a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices and finds that the proposed technique allows us to reduce the size of the authors' Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.

Convolutive Bottleneck Network features for LVCSR

A Convolutive Bottleneck Network is proposed as extension of the current state-of-the-art Universal Context Network and leads to 5.5% relative reduction of WER, compared to the Universal Context ANN baseline.

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.

Lower Frame Rate Neural Network Acoustic Models

On a large vocabulary Voice Search task, it is shown that with conventional models, one can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model, thus improving overall system speed.

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

Connectionist Temporal Classification

Experiments on speech and handwriting recognition show that a BLSTM network with a CTC output layer is an effective sequence labeller, generally outperforming standardHMMsandHMM-neural network hybrids, as well asmore recent sequence labelling algorithms such as large margin HMMs and conditional random fields.