AENet: Learning Deep Audio Features for Video Analysis

@article{Takahashi2018AENetLD,
  title={AENet: Learning Deep Audio Features for Video Analysis},
  author={Naoya Takahashi and Michael Gygli and Luc Van Gool},
  journal={IEEE Transactions on Multimedia},
  year={2018},
  volume={20},
  pages={513-524}
}
We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast… 
AReN: A Deep Learning Approach for Sound Event Recognition Using a Brain Inspired Representation
TLDR
A deep learning method to automatically recognize events of interest in the context of audio surveillance (namely screams, broken glasses and gun shots) is proposed, which outperforms the existing methods in terms of recognition rate.
Audio-Visual Model Distillation Using Acoustic Images
TLDR
This paper exploits a new multimodal labeled action recognition dataset acquired by a hybrid audio-visual sensor that provides RGB video, raw audio signals, and spatialized acoustic data, also known as acoustic images, where the visual and acoustic images are aligned in space and synchronized in time.
SoReNet: a novel deep network for audio surveillance applications
TLDR
A method for automatically analyzing the audio stream for surveillance purposes: it is able to detect the presence of abnormal events such as screams, gun shots and broken glasses, using a Convolutional Neural Network with the following two main properties.
Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization
TLDR
A novel acoustic scene classification system based on multimodal deep feature fusion is proposed, where three CNNs have been presented to perform 1D raw waveform modeling, 2D time-frequency image modeling, and 3D spatial-temporal dynamics modeling, respectively.
An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy
TLDR
An end-to-end audio classification system based on raw waveforms and a mix-training strategy to break the performance limitation caused by the amount of training data and exceeds the state-of-the-art multi-level attention model.
Deep Learning Frameworks Applied For Audio-Visual Scene Classification
In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual and audio features as well as their combination affect SC
End-to-End Audiovisual Speech Recognition System With Multitask Learning
TLDR
A novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system that considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme.
Large scale video classification using both visual and audio features on YouTube-8 M dataset
TLDR
This paper explores several models of different combination of video-level visual and audio features that provide a promising classifier for the Youtube-8M Kaggle challenge a video classification task for a dataset of 7 million YouTube videos belonging to 4716 classes.
A study on transfer learning for acoustic event detection in a real life scenario
TLDR
Experiments on the transfer from a synthetic source database to the real-life target database of DCASE 2016 demonstrate that transfer learning leads to improved detection performance on average, and the successful transfer to detect events which are very different from what was seen in the source domain could not be verified.
Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild
TLDR
A fusion network that merges cues from the different modalities in one representation is proposed that outperforms the challenge baselines and achieves an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 73 REFERENCES
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
TLDR
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
Exploiting spectro-temporal locality in deep learning based acoustic event detection
TLDR
Two different feature extraction strategies are explored using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED.
Improved audio features for large-scale multimedia event detection
TLDR
While the overall finding is that MFCC features perform best, it is found that ANN as well as LSP features provide complementary information at various levels of temporal resolution.
Audio-based multimedia event detection using deep recurrent neural networks
TLDR
This paper introduces longer-range temporal information with deep recurrent neural networks (RNNs) for both stages ofimedia event detection, and observes improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling
TLDR
A sparse audio frame-sampling method that improves event-detection speed and accuracy and shows the potential of using only a DNN for audio-based multimedia event detection for the first time.
Real-world acoustic event detection
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Bag-of-Audio-Words Approach for Multimedia Event Classification
TLDR
Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
...
1
2
3
4
5
...