Corpus ID: 236469219

Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

@article{NaranjoAlcazar2021SqueezeExcitationCR,
  title={Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification},
  author={Javier Naranjo-Alcazar and Sergi Perez-Castanos and Aaron Lopez-Garcia and Pedro Zuccarello and Maximo Cobos and Francesc J. Ferri},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.13180}
}
The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 26 REFERENCES
CNN depth analysis with different channel inputs for Acoustic Scene Classification.
TLDR
Different log-Mel representations and combinations are analyzed and geometric and arithmetic mean plus the Ordered Weighted Averaging (OWA) operator are studied as aggregation operators for the output of the different models of the ensemble. Expand
TASK 1 DCASE 2020: ASC WITH MISMATCH DEVICES AND REDUCED SIZE MODEL USING RESIDUAL SQUEEZE-EXCITATION CNNS Technical Report
Acoustic Scene Classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location such asExpand
A Review of Deep Learning Based Methods for Acoustic Scene Classification
TLDR
This article summarizes and groups existing approaches for data preparation, i.e., feature representations, feature pre-processing, and data augmentation, and for data modeling, i. Expand
Squeeze-and-Excitation Networks
TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. Expand
Deep Convolutional Neural Network with Mixup for Environmental Sound Classification
TLDR
A novel deep convolutional neural network is proposed to be used for environmental sound classification (ESC) tasks that uses stacked Convolutional and pooling layers to extract high-level feature representations from spectrogram-like features. Expand
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. Expand
Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks
TLDR
This paper introduces three variants of SE modules for image segmentation, and effectively incorporates these SE modules within three different state-of-the-art F-CNNs (DenseNet, SD-Net, U-Net) and observes consistent improvement of performance across all architectures, while minimally effecting model complexity. Expand
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Anomalous Sound Detection using unsupervised and semi-supervised autoencoders and gammatone audio representation
TLDR
A novel framework based on convolutional autoencoders (both unsupervised and semi-supervised) and a Gammatone-based representation of the audio is proposed and the results obtained substantially exceed the results presented as a baseline. Expand
Places: A 10 Million Image Database for Scene Recognition
TLDR
The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world, using the state-of-the-art Convolutional Neural Networks as baselines, that significantly outperform the previous approaches. Expand
...
1
2
3
...