• Corpus ID: 26806839

Audio Deepdream: Optimizing raw audio with convolutional networks

@inproceedings{Roberts2016AudioDO,
  title={Audio Deepdream: Optimizing raw audio with convolutional networks},
  author={Adam Roberts and Cinjon Resnick and Diego Ardila and Douglas Eck},
  year={2016}
}
The hallucinatory images of DeepDream [8] opened up the floodgates for a recent wave of artwork generated by neural networks. [] Key Method Consequently, we have followed in the footsteps of Van den Oord et al [13] and trained a network to predict embeddings that were themselves the result of a collaborative filtering model. A key difference is that we learn features directly from the raw audio, which creates a chain of differentiable functions from raw audio to high level features.

Figures from this paper

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification
TLDR
A CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations is proposed and extended using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks.
Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
TLDR
The experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset.
Learning Hierarchy Aware Embedding From Raw Audio for Acoustic Scene Classification
  • V. Abrol, Pulkit Sharma
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
TLDR
This work proposes a raw waveform based end-to-end ASC system using convolutional neural network that leverages the hierarchical relations between acoustic categories to improve the classification performance and uses a prototypical model.
On Using Backpropagation for Speech Texture Generation and Voice Conversion
TLDR
A proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances.
A Feature Learning Siamese Model for Intelligent Control of the Dynamic Range Compressor
TLDR
A siamese DNN model is proposed to learn the characteristics of the audio dynamic range compressor and shows better performance than handcrafted audio features when predicting DRC parameters for both mono-instrument audio loops and polyphonic music pieces.
Toward Audio Beehive Monitoring: Deep Learning vs. Standard Machine Learning in Classifying Beehive Audio Samples
TLDR
This investigation designed several convolutional neural networks and compared their performance with four standard machine learning methods in classifying audio samples from microphones deployed above landing pads of Langstroth beehive beehives, indicating that convolutionAL neural networks can be added to a repertoire of in situ audio classification algorithms for electronic beeh Hive monitoring.
Earthquake Event Classification Using Multitasking Deep Learning
TLDR
An attention-based convolutional neural network architecture for multitasking learning to accurately classify not only the presence of an earthquake but also the event type of the earthquake is proposed.
5-30-2018 Generating Audio Using Recurrent Neural Networks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v CHAPTER
On Video Analysis of Omnidirectional Bee Traffic: Counting Bee Motions with Motion Detection and Image Classification
TLDR
This investigation proposed, implemented, and partially evaluated a two-tier method for counting bee motions to estimate levels of omnidirectional bee traffic in bee traffic videos, which couples motion detection with image classification so that motion detection acts as a class-agnostic object location method that generates a set of regions with possible objects.
Sentiment analysis of Japanese text and vocabulary learning based on natural language processing and SVM
TLDR
This study combines the TF-IDF algorithm with SVM to construct a Japanese text sentiment classification model and proposes a chi-square statistic that combines word frequency factor, inter-class concentration coefficient, and correction coefficient.
...
1
2
...

References

SHOWING 1-10 OF 14 REFERENCES
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition
End-to-end learning for music audio
  • S. Dieleman, B. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.
Speech acoustic modeling from raw multichannel waveforms
TLDR
A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Texture Networks: Feed-forward Synthesis of Textures and Stylized Images
TLDR
This work proposes an alternative approach that moves the computational burden to a learning stage and trains compact feed-forward convolutional networks to generate multiple samples of the same texture of arbitrary size and to transfer artistic style from a given image to any other image.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
TLDR
This work considers image transformation problems, and proposes the use of perceptual loss functions for training feed-forward networks for image transformation tasks, and shows results on image style transfer, where aFeed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
A Neural Algorithm of Artistic Style
TLDR
This work introduces an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality and offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Deep content-based music recommendation
TLDR
This paper proposes to use a latent factor model for recommendation, and predict the latent factors from music audio when they cannot be obtained from usage data, and shows that recent advances in deep learning translate very well to the music recommendation setting, with deep convolutional neural networks significantly outperforming the traditional approach.
...
1
2
...