• Corpus ID: 208158326

Cross-modal supervised learning for better acoustic representations

  title={Cross-modal supervised learning for better acoustic representations},
  author={Shaoyong Jia and Xin Shu and Yan Zhi Yang and Dawei Liang and Qiyue Liu and Junhui Liu},
Obtaining large-scale human-labeled datasets to train acoustic representation models is a very challenging task. On the contrary, we can easily collect data with machine-generated labels. In this work, we propose to exploit machine-generated labels to learn better acoustic representations, based on the synchronization between vision and audio. Firstly, we collect a large-scale video dataset with 15 million samples, which totally last 16,320 hours. Each video is 3 to 5 seconds in length and… 


CNN architectures for large-scale audio classification
  • Shawn Hershey, S. Chaudhuri, +10 authors K. Wilson
  • Computer Science, Mathematics
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Large-Scale Video Classification with Convolutional Neural Networks
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning
A large-scale multi-label image database with 18M images and 11K categories is built, dubbed Tencent ML-Images, to enhance the quality of visual representation of the trained CNN model and to promote other vision tasks in the research and industry community.
SoundNet: Learning Sound Representations from Unlabeled Video
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
A multi-device dataset for urban acoustic scene classification
The acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task are introduced, and the performance of a baseline system in the task is evaluated.
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification
Using CNN classifier, the ConvRBM filterbank and its score-level fusion with the Mel filterbank energies (FBEs) gave an absolute improvement of 10.65 %, and 18.70 % in the classification accuracy, respectively, over FBEs alone on the ESC-50 database, shows that the proposed ConvR BM filterbank also contains highly complementary information over the Mel filters, which is helpful in the ESC task.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach.
This work formulates the problem of classifying the set of species present in an audio recording using the multi-instance multi-label (MIML) framework for machine learning, and proposes a MIML bag generator for audio, i.e., an algorithm which transforms an input audio signal into a bag-of-instances representation suitable for use with M IML classifiers.
Moments in Time Dataset: One Million Videos for Event Understanding
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.
A blind segmentation approach to acoustic event detection based on i-vector
A new blind segmentation approach to acoustic event detection (AED) based on i-vectors inspired by block-based automatic image annotation in image retrieval tasks, which shows promising results with an average of 8% absolute gain in F1 over the conventional hidden Markov model based approach.