Detection and Classification of Acoustic Scenes and Events

Abstract

Over the last decade, there has been an increased interest in the speech and audio processing community in code dissemination and public evaluation of proposed methods. Public evaluation can serve as a reference point for the performance of proposed methods and can also be used for studying performance improvements throughout the years. For example, source separation and automatic music transcription have been well-defined, they have their own performance metrics established, and public evaluations are performed for each (see the SiSEC evaluation for signal separation [3] for the 1st and the MIREX competition for music information retrieval [2] for the 2nd). However, for researchers working on the field of computational auditory scene analysis and specifically, on the tasks of modeling and identifying acoustic scenes containing non-speech and non-music and detecting audio events, there is not yet a coordinated established international challenge in this area. We therefore propose to organise a challenge on the performance evaluation of systems for the detection and classification of acoustic events. This challenge will help the research community move a step forward in better defining the specific task and will also provide incentive for researchers to actively pursue research on this field. Finally, it will help shedding light on controversies that currently exist in the task and offer a reference point for systems developed to perform parts of this task. We should mention that at present the closest challenge to the one we propose is TRECVID Multimedia Event Detection, where the focus is on audiovisual, multi-modal event detection in video recordings [4]. There are researchers that are using only the audio from the TRECVID challenge in order to evaluate their systems but a dataset explicitly developed for audio challenges would offer a much better evaluation framework since it would be much more varied with respect to audio. In addition, such a dataset would be made so that it would address the needs for a more thorough evaluation of audio analysis systems and would potentially be used more widely and set itself as a standard. We should also note that a public evaluation on Audio Segmentation and Speaker Diarization [5] has also been proposed. This proposed evaluation task consists of segmenting a broadcast news audio document into a few specific classes that are: music, speech, speech with music/noise in background or other. Therefore it is addressing a very specific task and it does not overlap with the current proposal. Finally, one public evaluation that is related to the proposed challenge took place in 2006 and 2007, as part of the CLEAR evaluations [8], funded by the CHIL project. Several tasks on audio-only, video-only or multimodal tracking and event detection were proposed and among them was an evaluation on “Acoustic Event Detection and Classification”. The datasets were recorded during several interactive seminars and contain events related to seminars (speech, applause, chair moving, etc). From the datasets created for the evaluations, the “FBK-Irst database of isolated meeting-room acoustic events” [7] has widely been used in the event detection literature; however, the aforementioned dataset contains only non-overlapping events. The CLEAR evaluations, although promising and innovative at the time, did not lead to the establishment of a widely-accepted evaluation challenge for this type of

020406020132014201520162017
Citations per Year

83 Citations

Semantic Scholar estimates that this publication has 83 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Giannoulis2013DetectionAC, title={Detection and Classification of Acoustic Scenes and Events}, author={Dimitrios Giannoulis and Emmanouil Benetos and Dan Stowell and Mathias Rossignol and Mathieu Lagrange and Mark Plumbley}, year={2013} }