• Corpus ID: 237274283

Audio Recognition using Mel Spectrograms and Convolution Neural Networks

  title={Audio Recognition using Mel Spectrograms and Convolution Neural Networks},
  author={Boyang Zhang Jared Leitner and Samuel Thornton},
Automatic sound recognition has received heightened research interest in recent years due to its many potential applications. These include automatic labeling of video/audio content and real-time sound detection for robotics. While image classification is a heavily researched topic, sound identification is less mature. In this study, we take advantage of the robust machine learning techniques developed for image classification and apply them on the sound recognition problem. Raw audio data from… 

Figures and Tables from this paper

Environmental Sound Classification Based on Transfer-Learning Techniques with Multiple Optimizers

This paper aims to determine the effectiveness of employing pre-trained convolutional neural networks (CNNs) for audio categorization and the feasibility of retraining, and investigates various hyper-parameters and optimizers, such as optimal learning rate, epochs, and Adam, Adamax, and RMSprop optimizers for several pre- trained models.

Investigating Multi-Feature Selection and Ensembling for Audio Classification

An extensive evaluation of the performance of several cutting-edge DL models with various state-of-the-art audio features with a focus on feature selection suggests feature selection depends on both the dataset and the model.


This work uses pre-trained models trained using AudioSet data, a large-scale dataset of manually annotated audio events, to create captions that can explain the given audio sound data using machine learning techniques.

Implementation of Constant-Q Transform (CQT) and Mel Spectrogram to converting Bird’s Sound

The change of bird’s voice analogously to mel spectrogram, classified in CNN is changed and the resulting images from this study can be classified using CNN to help classify bird sounds.

Deep Learning for Enhanced Scratch Input

The results indicate high potential for the application of deep learning techniques to natural user interface systems that can readily convert large unpowered surfaces into a user interface using just a smartphone with no special-purpose sensors or hardware.

Improved remote mental health illness assessment and detection using facial emotion detection and speech emotion detection

This study uses the original therapy techniques for mental health assessment and integrates it with machine learning models for facial emotion recognition and speech pattern recognition to get a better understanding of a patient’s mental health condition and help them deal with it.

Diagnosing the Stage of COVID-19 using Machine Learning on Breath Sounds

A machine-learning based approach is proposed to monitor lung condition by analyzing the breath sounds of a patient for respiratory sounds like wheezes, crackles and tachypnea, which in turn can identify the stage of COVID-19.

FLOR: A Federated Learning-based Music Recommendation Engine

A federated learning-based approach that scalably tunes music clusters to accurately describe users' preferences in a particular genre and can improve song recommendations based on the user's musical tastes is proposed.



CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.

Environmental sound classification with convolutional neural networks

  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.

Environmental Sound Recognition With Time–Frequency Audio Features

An empirical feature analysis for audio environment characterization is performed and a matching pursuit algorithm is proposed to use to obtain effective time-frequency features to yield higher recognition accuracy for environmental sounds.

A Benchmark Dataset for Audio Classification and Clustering

This work presents a freely available benchmark dataset for audio classification and clustering that consists of 10 seconds samples of 1886 songs obtained from the Garageband site, and presents some initial results using a set of audio features generated by a feature construction approach.

Freesound Datasets: A Platform for the Creation of Open Audio Datasets

Comunicacio presentada al 18th International Society for Music Information Retrieval Conference celebrada a Suzhou, Xina, del 23 al 27 d'cotubre de 2017.

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This is the 24th conference in a series of conferences presenting work in experimental and theoretical signal processing, speech and acoustics."

Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals

This paper presents a novel audio dataset of English spoken digits which is used for classification tasks on spoken digits and speaker's gender and confirms that the networks are highly reliant on features marked as relevant by LRP.

Spectrogram, Cepstrum and Mel- Frequency Analysis

  • Carnegie Mellon University.

The Physics Hypertextbook