You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

  title={You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection},
  author={S. Venkatesh and Dave Moffat and Eduardo Reck Miranda},
Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called… 

Figures and Tables from this paper

Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy audios in the VOICe Dataset

The You Only Hear Once (YOHO) algorithm can match the performance of the various state-of-the-art algorithms on datasets such as Music Speech Detection Dataset, TUT Sound Event, and Urban-SED datasets but at lower inference times.

Computational bioacoustics with deep learning: a review and roadmap

This paper offers a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions.

A Study on the Use of wav2vec Representations for Multiclass Audio Segmentation

Experimental results show that wav2vec representations can improve the performance of audio segmentation systems for classes containing speech, while showing a degradation in the segmentation of isolated music.

Sound Classification and Processing of Urban Environments: A Systematic Literature Review

It can be realized that Deep Learning architectures, attention mechanisms, data augmentation techniques, and pretraining are the most crucial factors to consider while creating an efficient sound classification model.

I see what you hear: a vision-inspired method to localize words

A lightweight solution for word detection and localization using bounding box regression for word localization, which enables the model to detect the occurrence, offset, and duration of keywords in a given audio stream.

Extending Radio Broadcasting Semantics through Adaptive Audio Segmentation Automations

The present paper focuses on adaptive audio detection, segmentation and classification techniques in audio broadcasting content, dedicated mainly to voice data, and contributes to the formulation of a dynamic Generic Audio Classification Repository to be subjected to adaptive multilabel experimentation with more sophisticated techniques, such as deep architectures.



Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast

  • S. VenkateshD. Moffat E. Miranda
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
The data synthesis procedure is demonstrated as a highly effective technique to generate large datasets to train deep neural networks for audio segmentation and outperformed state-of-the-art algorithms for music-speech detection.

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN).

Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora

A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN), and a new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate.

Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast

The proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative and shows that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners.

Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion

The evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzín-2010 evaluation campaign is presented, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.

End-to-end learning for music audio

  • S. DielemanB. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity reach state-of-the-art performance levels for the three different categories of sound.

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

This paper presents a new approach based on recurrent neural networks to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these, and shows that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%.

Temporal Convolutional Networks for Speech and Music Detection in Radio Broadcast

The study shows that Temporal Convolution Network (TCN) architectures can outperform state-of-the-art architectures and the novel non-causal TCN extension introduced in this paper leads to a significant improvement of the accuracy.

Audio analysis for surveillance applications

The proposed hybrid solution is capable of detecting new kinds of suspicious audio events that occur as outliers against a background of usual activity and adaptively learns a Gaussian mixture model to model the background sounds and updates the model incrementally as new audio data arrives.