Pyramidal Temporal Pooling With Discriminative Mapping for Audio Classification

  title={Pyramidal Temporal Pooling With Discriminative Mapping for Audio Classification},
  author={Liwen Zhang and Ziqiang Shi and Jiqing Han},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
Audio signals are temporally-structured data, and learning their discriminative representations containing temporal information is crucial for the audio classification. In this article, we propose an audio representation learning method with a hierarchical pyramid structure called pyramidal temporal pooling (PTP) which aims to capture the temporal information of an entire audio sample. By stacking a global temporal pooling layer on multiple local temporal pooling layers, the PTP can capture the… 
Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification
This letter proposes an end-to-end 3D Convolutional Neural Network for ASC, named SeNoT-Net, which can generate effective audio representations by capturing temporal relations from semantic neighbors of different receptive fields over time.
ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification
This paper proposes a 3D CNN for ASC, named ATReSN-Net, which can capture temporal relations of different receptive fields from arbitrary time-frequency locations by mapping the semantic features obtained from the residual block into a semantic space.
Convolutional Receptive Field Dual Selection Mechanism for Acoustic Scene Classification
A convolution receptive field dual selection mechanism (CRFDS), designed to replace the convolution layers of CNN, which has the ability to find the optimal RFs in two dimensions simultaneously, which improve the semantic feature extraction ability of CNN in spectrogram.
CNN-Based Acoustic Scene Classification System
A more general classification model was proposed by combining the harmonic-percussive source separation and deltas-deltadeltas features with four different models to develop a low-complexity model.
A Temporal-oriented Broadcast ResNet for COVID-19 Detection
TorNet achieves competitive results with a higher computational efficiency than other state-of-the-art alternatives, and achieves 72.2% Unweighted Average Recall on the INTERPSEECH 2021 Computational Paralinguistics Challenge COVID-19 cough Sub-Challenge.
Audio Attacks and Defenses against AED Systems - A Practical Study
The robustness of multiple security critical AED tasks, implemented as CNNs classifiers, as well as existing thirdparty Nest devices, manufactured by Google, which run their own black-box deep learning models are tested.
Deep Neural Decision Forest for Acoustic Scene Classification
This paper proposes a novel approach for ASC using deep neural decision forest (DNDF), which combines a fixed number of convolutional layers and a decision forest as the final classifier that improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-theart baselines.


Unsupervised Temporal Feature Learning Based on Sparse Coding Embedded BoAW for Acoustic Event Recognition
A novel unsupervised temporal feature learning method, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the BoAW histograms sequence and its time indexes using a non-linear Support Vector Regression (SVR) model.
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
AENet: Learning Deep Audio Features for Video Analysis
A convolutional neural network operating on a large temporal input allows for an audio event detection system end to end and performs transfer learning and shows that the model learned generic audio features, similar to the way CNNs learn generic features on vision tasks.
Rank Pooling for Action Recognition
A function-based temporal pooling method that captures the latent structure of the video sequence data - e.g., how frame-level features evolve over time in a video - and is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions.
Audio concept classification with Hierarchical Deep Neural Networks
This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos with a proposed system comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information.
A discriminative CNN video representation for event detection
This paper proposes using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable, in a new state-of-the-art performance in event detection over the largest video datasets.
Time–Frequency Matrix Feature Extraction and Classification of Environmental Audio Signals
The results of the numerical simulation support the effectiveness of the proposed approach for environmental audio classification with over 10% accuracy-rate improvement compared to the MFCC features.
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling
A sparse audio frame-sampling method that improves event-detection speed and accuracy and shows the potential of using only a DNN for audio-based multimedia event detection for the first time.
Improved audio features for large-scale multimedia event detection
While the overall finding is that MFCC features perform best, it is found that ANN as well as LSP features provide complementary information at various levels of temporal resolution.
Bag-of-Audio-Words Approach for Multimedia Event Classification
Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.