• Publications
  • Influence
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Unsupervised Learning of Semantic Audio Representations
TLDR
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking
TLDR
This work proposes a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process, finding that a simple optimisation of the training label set improves recognition performance without additional computation.
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
TLDR
A learning framework for sound representation and recognition that combines a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, a clustering objective that reflects the authors' need to impose categorical structure on their experiences, and a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes is presented.
The Benefit of Temporally-Strong Labels in Audio Event Classification
TLDR
It is shown that fine-tuning with a mix of weak- and strongly-labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels.
Accelerating Inference: towards a full Language, Compiler and Hardware stack
TLDR
Dimple allows the user to specify probabilistic models in the form of graphical models, Bayesian networks, or factor graphs, and performs inference (by automatically deriving an inference engine from a variety of algorithms) on the model.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
TLDR
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Towards Learning Semantic Audio Representations from Unlabeled Data
TLDR
This work considers several class-agnostic semantic constraints that are inherent to non-speech audio and applies them to sample training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks to learn semantically structured audio representations.
Self-Supervised Learning from Automatically Separated Sound Scenes
TLDR
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.