You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

@article{Venkatesh2022YouOH,
  title={You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection},
  author={S. Venkatesh and Dave Moffat and Eduardo Reck Miranda},
  journal={ArXiv},
  year={2022},
  volume={abs/2109.00962}
}
Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called… 

Figures and Tables from this paper

Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy audios in the VOICe Dataset

TLDR
The You Only Hear Once (YOHO) algorithm can match the performance of the various state-of-the-art algorithms on datasets such as Music Speech Detection Dataset, TUT Sound Event, and Urban-SED datasets but at lower inference times.

Computational bioacoustics with deep learning: a review and roadmap

TLDR
This paper offers a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions.

Extending Radio Broadcasting Semantics through Adaptive Audio Segmentation Automations

TLDR
The present paper focuses on adaptive audio detection, segmentation and classification techniques in audio broadcasting content, dedicated mainly to voice data, and contributes to the formulation of a dynamic Generic Audio Classification Repository to be subjected to adaptive multilabel experimentation with more sophisticated techniques, such as deep architectures.

References

SHOWING 1-10 OF 58 REFERENCES

Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast

  • S. VenkateshD. Moffat E. Miranda
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
The data synthesis procedure is demonstrated as a highly effective technique to generate large datasets to train deep neural networks for audio segmentation and outperformed state-of-the-art algorithms for music-speech detection.

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

TLDR
A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN).

Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora

TLDR
A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN), and a new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate.

Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast

TLDR
The proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative and shows that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners.

Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion

TLDR
The evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzín-2010 evaluation campaign is presented, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.

End-to-end learning for music audio

  • S. DielemanB. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

TLDR
Two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity reach state-of-the-art performance levels for the three different categories of sound.

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

TLDR
This paper presents a new approach based on recurrent neural networks to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these, and shows that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%.

Temporal Convolutional Networks for Speech and Music Detection in Radio Broadcast

TLDR
The study shows that Temporal Convolution Network (TCN) architectures can outperform state-of-the-art architectures and the novel non-causal TCN extension introduced in this paper leads to a significant improvement of the accuracy.

Audio analysis for surveillance applications

TLDR
The proposed hybrid solution is capable of detecting new kinds of suspicious audio events that occur as outliers against a background of usual activity and adaptively learns a Gaussian mixture model to model the background sounds and updates the model incrementally as new audio data arrives.
...