• Corpus ID: 89616388

The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging

  title={The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging},
  author={Zhicun Xu and Peter Smit and Mikko Kurimo},
In this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural network structures including LSTM and multi-level attention models. In our experiments, the multi-level attention… 

Figures and Tables from this paper

Feature-informed Embedding Space Regularization For Audio Classification
The proposed regularization methods not only outperform baseline methods, but also can improve state-ofthe-art models on several audio classification tasks and suggest that using the mixture of features performs better than using individual features.


Multi-level Attention Model for Weakly Supervised Audio Classification
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging
The experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset.
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline
The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology.
Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
A Multi-level Weighted Representation for Person Re-identification
A multi-level weighted representation for person re-identification is proposed, in which features containing strong discriminative powers or rich semantic meanings are extracted from different layers of a deep CNN, and an estimation subnet evaluates the quality of each feature and generates quality scores used as concatenation weights for all multi- level features.
Segmental Modeling for Audio Segmentation
  • Hagai Aronowitz
  • Computer Science
    2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007
A novel segment based generative model in which audio segments are modeled as supervectors and each class (speech, silence, music) is modeled by a distribution over the supervector space.
Very Deep Convolutional Networks for Large-Scale Image Recognition
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.