• Corpus ID: 240419721

AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence

  title={AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence},
  author={Yun-Ning Hung and Karn N. Watcharasupat and Chih-Wei Wu and Iroro Orife and Kelian Li and Pavan Seshadri and Junyoung Lee},
We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking… 
1 Citations

Figures and Tables from this paper

Music Classification: Beyond Supervised Learning, Towards Real-world Applications
This book focuses on the more modern history of music classification since the popularization of deep learning in mid 2010s, which was mainly the design of audio features and adoption of classifiers as well as the birth of many music classification problems.


AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
A new dataset is described which will be released publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for speech activity detection.
Temporal Convolutional Networks for Speech and Music Detection in Radio Broadcast
The study shows that Temporal Convolution Network (TCN) architectures can outperform state-of-the-art architectures and the novel non-causal TCN extension introduced in this paper leads to a significant improvement of the accuracy.
Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel
The proposed method consistently showed better performance in all the three languages than the baseline system, and the F-score ranged from 86.5% for British data to 95.9% for Korean drama data.
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments, selected from the Google AudioSet dataset.
A convolutional neural network (CNN) based architecture is proposed for MIREX 2018 music and speech detection challenge and is part of the inaSpeechSegmenter open-source framework, which was designed for conducting gender equality studies.
Construction and evaluation of a robust multifeature speech/music discriminator
  • E. D. Scheirer, M. Slaney
  • Computer Science
    1997 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1997
A real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input is constructed and extensive data on system performance and the cross-validated training/test setup used to evaluate the system is provided.
Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast
The proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative and shows that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners.
MUSAN: A Music, Speech, and Noise Corpus
This report introduces a new corpus of music, speech, and noise suitable for training models for voice activity detection (VAD) and music/speech discrimination and demonstrates use of this corpus on Broadcast news and VAD for speaker identification.
Open Broadcast Media Audio from TV: A Dataset of TV Broadcast Audio with Relative Music Loudness Annotations
Open Broadcast Media Audio from TV (OpenBMAT) is an open, annotated dataset for the task of music detection that contains over 27 hours of TV broadcast audio from 4 countries distributed over 1647 one-minute long excerpts and is the first to include annotations about the loudness of music in relation to other simultaneous non-music sounds.