Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes

@article{Kumar2018KnowledgeTF,
  title={Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes},
  author={Anurag Kumar and Maksim Khadkevich and Christian F{\"u}gen},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={326-330}
}
In this work we propose approaches to effectively transfer knowledge from weakly labeled web audio data. We first describe a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data. Our model trains efficiently from audios of variable lengths; hence, it is well suited for transfer learning. We then propose methods to learn representations using this model which can be effectively used for solving the target task. We study… 

Figures and Tables from this paper

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling
TLDR
A network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED), which consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time.
A Closer Look at Weak Label Learning for Audio Events
TLDR
This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.
Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
TLDR
A deep convolutional neural network model called DSNet based on densely connected convolution networks (DenseNets) and squeeze-and-excitation networks (SENets) for weakly supervised training of AED is introduced, which alleviates the vanishing-gradient problem and strengthens feature propagation and models interdependencies between channels.
Cure Dataset: Ladder Networks for Audio Event Classification
TLDR
The CURE dataset is established which contains curated set of specific audio events most relevant for people with hearing loss, which establishes the superiority of Ladder network over ELM and SVM classifier in terms of robustness and increased classification accuracy.
Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
TLDR
This work develops methods that identify events and localize corresponding AV cues in unconstrained videos using weak labels, and demonstrates the framework's ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization.
Do sound event representations generalize to other audio tasks? A case study in audio transfer learning
TLDR
This paper investigates the transfer learning capacity of audio representations obtained from neural networks trained on a large-scale sound event detection dataset, and evaluates these representations across a wide range of other audio tasks.
JOINT TRAINING OF GUIDED LEARNING AND MEAN TEACHER MODELS FOR SOUND EVENT DETECTION
TLDR
This paper's proposed model structure includes a feature-level front-end based on convolution neural networks (CNN), followed by both embedding-level and instance-level back-end attention modules, and a set of adaptive median windows for individual sound events is used to smooth the framelevel predictions in post-processing.
Learning Sound Events From Webly Labeled Data
TLDR
This work introduces webly labeled learning for sound events which aims to remove human supervision altogether from the learning process, and develops a method of obtaining labeled audio data from the web, in which no manual labeling is involved.
Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization
TLDR
A novel acoustic scene classification system based on multimodal deep feature fusion is proposed, where three CNNs have been presented to perform 1D raw waveform modeling, 2D time-frequency image modeling, and 3D spatial-temporal dynamics modeling, respectively.
Self-supervised Attention Model for Weakly Labeled Audio Event Classification
TLDR
A novel weakly labeled Audio Event Classification approach based on a self-supervised attention model that achieves 8.8% and 17.6% relative mean average precision improvements over the current state-of-the-art systems for SL-DCASE-17and balanced AudioSet.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data
TLDR
A robust and efficient deep convolutional neural network (CNN) based framework to learn audio event recognizers from weakly labeled data that can train from and analyze recordings of variable length in an efficient manner and outperforms a network trained with {\em strongly labeled} web data by a considerable margin.
Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
TLDR
A weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events and the attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames.
Audio Event Detection using Weakly Labeled Data
TLDR
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Transfer learning of weakly labelled audio
  • Aleksandr Diment, T. Virtanen
  • Computer Science
    2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2017
TLDR
This work proposes to solve the weakly labelled problem of sound event tagging with small amounts of training data by transferring the abstract knowledge about the nature of audio data from another tagging task by using pre-training of a recurrent neural network or its parts to perform one tagging task given abundant and diverse training data.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks
TLDR
This work designs a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset, and shows that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Learning environmental sounds with end-to-end convolutional neural network
  • Yuji Tokozume, T. Harada
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
This paper proposes a novel end-to-end ESC system using a convolutional neural network (CNN) and achieves a 6.5% improvement in classification accuracy over the state-of-the-art logmel-CNN with the static and delta log-mel feature, simply by combining the system and logMel-CNN.
...
...