Learning Sound Events From Webly Labeled Data

  title={Learning Sound Events From Webly Labeled Data},
  author={Anurag Kumar and Ankit Shah and Alexander Hauptmann and Bhiksha Raj},
In the last couple of years, weakly labeled learning has turned out to be an exciting approach for audio event detection. In this work, we introduce webly labeled learning for sound events which aims to remove human supervision altogether from the learning process. We first develop a method of obtaining labeled audio data from the web (albeit noisy), in which no manual labeling is involved. We then describe methods to efficiently learn from these webly labeled audio recordings. In our proposed… 

Figures and Tables from this paper

Audio tagging with noisy labels and minimal supervision
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Audio Tagging by Cross Filtering Noisy Labels
This article presents a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging, and achieves state-of-the-art performance and even surpasses the ensemble models on FSDKaggle2018 dataset.
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition
It is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data, and in these situations a single stage of learning is not sufficient, so a sequential stage-wise learning process is proposed that improves generalization capabilities of a given modeling system.
Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers
This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.
Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning
This work presents a novel visual-assisted teacherstudent mutual learning framework for robust sound event detection from audio recordings that takes the advantages of joint audiovisual analysis in training while maximizing the feasibility of the model in use cases.
Small-Vote Sample Selection for Label-Noise Learning
This paper proposes a novel yet simple sample selection method, which mainly consists of a Hierarchical Voting Scheme (HVS) and an Adaptive Clean data rate Estimation Strategy (ACES), to accurately identify clean samples and noisy-labeled samples for robust learning.
ARCA23K: An audio dataset for investigating open-set label noise
It is shown that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and this type of label noise is referred to as open-set label noise.
Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking
This work proposes a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process, finding that a simple optimisation of the training label set improves recognition performance without additional computation.
SeCoST:: Sequential Co-Supervision for Large Scale Weakly Labeled Audio Event Detection
  • Anurag Kumar, V. Ithapu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new framework for designing learning models with weak supervision by bridging ideas from sequential learning and knowledge distillation is proposed, referred to as SeCoST (pronounced Sequest) — Sequential Co-supervision for training generations of Students.


Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes
This work describes a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data and proposes methods to learn representations using this model which can be effectively used for solving the target task.
Learning to Detect Concepts from Webly-Labeled Video Data
This paper presents compelling insights on the latent non-convex robust loss that is being minimized on the noisy data and proposes two novel techniques that not only enable WELL to be applied to big data but also lead to more accurate results.
Audio Event Detection using Weakly Labeled Data
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
Learning to Recognize Transient Sound Events using Attentional Supervision
This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV), establishing a new state-of-theart for DCASE17 and AudioSet data sets.
Adaptive Pooling Operators for Weakly Labeled Sound Event Detection
This paper treats SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality, and develops a family of adaptive pooling operators—referred to as autopool—which smoothly interpolate between common pooling Operators, and automatically adapt to the characteristics of the sound sources in question.
Audio event and scene recognition: A unified approach using strongly and weakly labeled data
  • B. Raj, Anurag Kumar
  • Computer Science
    2017 International Joint Conference on Neural Networks (IJCNN)
  • 2017
The main method is based on manifold regularization on graphs in which it is shown that the unified learning can be formulated as a constraint optimization problem which can be solved by iterative concave-convex procedure (CCCP).
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Co-sampling: Training Robust Networks for Extremely Noisy Supervision
Free of the matrix estimation, a simple but robust learning paradigm called "Co-sampling" is presented, which can train deep networks robustly under extremely noisy labels and demonstrates that it trains deep learning models robustly.
Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
A weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events and the attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames.
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.