Temporal Knowledge Distillation for on-device Audio Classification

  title={Temporal Knowledge Distillation for on-device Audio Classification},
  author={Kwanghee Choi and Martin Kersner and Jacob Morton and Buru Chang},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Kwanghee ChoiMartin Kersner Buru Chang
  • Published 27 October 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring the knowledge from large models to on-device models. However, most lack a mechanism to distill the essence of the temporal information, which is crucial to audio classification tasks, or similar architecture is often required. In this paper, we propose a new… 

Figures and Tables from this paper

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.

Simple Pooling Front-ends For Efficient Audio Classification

Experimental results show that SimPFs can achieve a reduction in more than half of the number of floating point operations (FLOPs) for off-the-shelf audio neural networks, with negligible degradation or even some improvements in audio classification performance.

Continual Learning for On-Ddevice Environmental Sound Classification

Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that the proposed continual learning method outperforms baseline continual learning methods on classification accuracy and computational efflciency, indicating the method can ef-שּׁ�ciently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classi-cation.

Opening the Black Box of wav2vec Feature Encoder

This paper focuses on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units, and concludes that various information is embedded inside the feature encoding representations: fundamental frequency, formants, and amplitude, packed with sufficient temporal detail.

Distilling a Pretrained Language Model to a Multilingual ASR Model

A novel method called the Distilling a Language model to a Speech model (Distill-L2S), which aligns the latent representations of two different modalities, and shows the superiority of this method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.

Learning the Spectrogram Temporal Resolution for Audio Classification

Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, it is shown that DiffRes can improve classification accuracy with the same computational complexity, which alleviates the computational cost at the same time.



Knowledge distillation for small-footprint highway networks

This paper significantly improved the recognition accuracy of the HDNN acoustic model with less than 0.8 million parameters, and narrowed the gap between this model and the plain DNN with 30 million parameters.

Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

A temporal convolution for real-time KWS on mobile devices that exploits temporal convolutions with a compact ResNet architecture and achieves more than \textbf{385x} speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Keyword Transformer: A Self-Attention Model for Keyword Spotting

The Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data, is introduced.

Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging

This novel KD method, "Intra-Utterance Similarity Preserving KD" (IUSP), shows promising results for the audio tagging task and shows consistent improvements over SP across student models of different sizes on the DCASE 2019 Task 5 dataset for audio tagging.

CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition

It is found that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting

Systems and methods for creating and using Convolutional Recurrent Neural Networks for small-footprint keyword spotting (KWS) systems and a CRNN model embodiment demonstrated high accuracy and robust performance in a wide range of environments are described.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

This work leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation, and outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.