Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms

  title={Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms},
  author={Steffen Illium and Robert M{\"u}ller and Andreas Sedlmeier and Claudia Linnhoff-Popien},
In many fields of research, labeled datasets are hard to acquire. This is where data augmentation promises to overcome the lack of training data in the context of neural network engineering and classification tasks. The idea here is to reduce model over-fitting to the feature distribution of a small under-descriptive training dataset. We try to evaluate such data augmentation techniques to gather insights in the performance boost they provide for several convolutional neural networks on mel… 

Figures and Tables from this paper

Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

This paper presents two effective neural network models to detect surgical masks from audio, based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals.

Visual Transformers for Primates Classification and Covid Detection

The vision transformer is applied to mel-spectrogram representations of raw audio recordings, a deep machine learning model build around the attention mechanism that achieves comparable performance on both (PRS and CCS challenge) tasks of ComParE21, outperforming most single model baselines.

VoronoiPatches: Evaluating A New Data Augmentation Method

A new data augmentation algorithm: VoronoiPatches (VP), which primarily utilize non-linear re-combination of information within an image, fragmenting and occluding small information patches and improves CNN model robustness on unseen data.

Using Self-Supervised Feature Extractors with Attention for Automatic COVID-19 Detection from Speech

Experimental results demonstrate that models trained on features extracted from self-supervised models perform similarly or outperform fully- supervised models and models based on handcrafted features.

A Deep and Recurrent Architecture for Primate Vocalization Classification

This work presents a deep and recurrent architecture for the classification of primate vocalizations that is based upon well proven modules such as bidirectional Long Short-Term Memory neural networks, pooling, normalized softmax and focal loss.

Analytical Review of Audiovisual Systems for Determining Personal Protective Equipment on a Person's Face

An analytical review of existing and developing intelligent information technologies for bimodal analysis of the voice and facial characteristics of a masked person is presented.

End-to-end Ensemble-based Feature Selection for Paralinguistics Tasks

This work proposes an ensemble-based automatic feature selection method to enable the development of fast and memory-efficient systems and proposes an output-gradient-based method to discover essential features using large, well-performing ensembles before training a smaller one.

Identifying surgical-mask speech using deep neural networks on low-level aggregation

This work proposes an MSI approach using deep networks on Low-Level Aggregation (LLA) for speech chunks, resulting in more adaptation to deep models through inputting much more samples in training without employing pre-trained knowledge.



CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

Very deep convolutional neural networks for raw waveforms

This work proposes very deep convolutional neural networks that directly use time-domain waveforms as inputs that are efficient to optimize over very long sequences, necessary for processing acoustic waveforms.

SubSpectralNet – Using Sub-spectrogram Based Convolutional Neural Networks for Acoustic Scene Classification

SubSpectralNet is a novel model which captures discriminative features by incorporating frequency band-level differences to model soundscapes by using mel-spectrograms and developing a sub-spectrogram based CNN architecture for ASC.

Environmental sound classification with convolutional neural networks

  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.

A Deep Residual Network for Large-Scale Acoustic Scene Analysis

The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.

Audio augmentation for speech recognition

This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.

Exploring Data Augmentation to Improve Music Genre Classification with ConvNets

This work addressed the automatic music genre classification as a pattern recognition task, using spectrograms created from the audio signal and described the patterns by representation learning obtained with the use of convolutional neural network (CNN).

Specaugment on Large Scale Datasets

  • Daniel S. ParkYu Zhang Yonghui Wu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper demonstrates its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset and introduces a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks.

Randomly Weighted CNNs for (Music) Audio Classification

  • Jordi PonsX. Serra
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This work uses features extracted from the embeddings of deep architectures as input to a classifier – with the goal to compare classification accuracies when using different randomly weighted architectures.


This report describes the 4 submissions for Task 1 (Audio scene classification) of the DCASE-2016 challenge of the CP-JKU team and proposes a novel i-vector extraction scheme for ASC using both left and right audio channels and a Deep Convolutional Neural Network architecture trained on spectrograms of audio excerpts in end-to-end fashion.