• Publications
  • Influence
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
The Million Song Dataset
TLDR
The Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks, is introduced and positive results on year prediction are shown, and the future development of the dataset is discussed.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
The ICSI Meeting Corpus
TLDR
A corpus of data from natural meetings that occurred at the International Computer Science Institute in Berkeley, California over the last three years is collected, which supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more.
librosa: Audio and Music Signal Analysis in Python
TLDR
A brief overview of the librosa library's functionality is provided, along with explanations of the design goals, software development practices, and notational conventions.
Tandem connectionist feature extraction for conventional HMM systems
TLDR
A large improvement in word recognition performance is shown by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling.
Consumer video understanding: a benchmark database and an evaluation of human and machine performance
TLDR
A new database, CCV, containing 9,317 web videos over 20 semantic categories, including events like "baseball" and "parade", scenes like "beach", and objects like "cat" is described and released, finding that humans are much better at understanding categories of nonrigid objects such as "cat", while current automatic techniques are relatively close to humans in recognizing categories that have distinctive background scenes or audio patterns.
Prediction-driven computational auditory scene analysis
TLDR
A blackboard-based implementation of the 'prediction-driven' approach is described which analyzes dense, ambient sound examples into a vocabulary of noise clouds, transient clicks, and a correlogram-based representation of wide-band periodic energy called the weft.
Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking
TLDR
A system that attempts to identify such a relationship between music audio recordings, including best performance on an independent international evaluation, where the system achieved a mean reciprocal ranking of 0.49 for true cover versions among top-10 returns.
Model-Based Expectation-Maximization Source Separation and Localization
TLDR
This paper describes a model-based expectation-maximization source separation and localization system for separating and localizing multiple sound sources from an underdetermined reverberant two-channel recording, and creates probabilistic spectrogram masks that can be used for source separation.
...
1
2
3
4
5
...