PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

@article{Kong2020PANNsLP,
  title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
  author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2020},
  volume={28},
  pages={2880-2894}
}
Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained… 
PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
TLDR
The design philosophy and core architecture of PaddleSpeech is described to support several essential speech- to-text and text-to-speech tasks to achieve competitive or state-of-the-art performance on various speech datasets.
Wider or Deeper Neural Network Architecture for Acoustic Scene Classification with Mismatched Recording Devices
TLDR
A robust and low complexity system for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording, in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
TLDR
HTS-AT is introduced: an audio transformer with a hierarchical structure to reduce the model size and training time, and is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection and localization in time.
Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model
TLDR
The proposed method achieves CCCs of 0.4117 and 0.6649 for arousal and valence respectively on the test set of MuSe-Wilder, which outperforms the baseline system by a large margin and is ranked top3 in both sub-challenges.
Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild
TLDR
This work proposes an improvement based on Squeeze-and-Excitation mechanisms upon a recently proposed audio tagging ResNet, and shows that it performs significantly better than the baseline, as well as a collection of other recent audio models.
Multimodal Self-Supervised Learning of General Audio Representations
TLDR
This work demonstrates that their contrastive framework does not require high resolution images to learn good audio features, and is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
The Influence of Audio on Video Memorability with an Audio Gestalt Regulated Video Memorability System
TLDR
A novel multimodal deep learning-based late-fusion system that uses audio gestalt to estimate the influence of a given video’s audio on its overall short-term recognition memorability, and selectively leverages audio features to make a prediction accordingly.
Perceiver: General Perception with Iterative Attention
TLDR
This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
LEAF: A Learnable Frontend for Audio Classification
TLDR
This work introduces a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks, and outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.
CL4AC: A Contrastive Loss for Audio Captioning
TLDR
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment betweenaudio and texts, while trained with limited data.
...
...

References

SHOWING 1-10 OF 61 REFERENCES
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
TLDR
This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
TLDR
The experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset.
Very deep convolutional neural networks for raw waveforms
TLDR
This work proposes very deep convolutional neural networks that directly use time-domain waveforms as inputs that are efficient to optimize over very long sequences, necessary for processing acoustic waveforms.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
TLDR
A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Automatic Tagging Using Deep Convolutional Neural Networks
TLDR
The experiments show that mel-spectrogram is an effective time-frequency representation for automatic tagging and that more complex models benefit from more training data.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Audio tagging system for DCASE 2018: focusing on label noise data augmentation and its efficient learning,
  • 2018
...
...