Structured Models for Semantic Analysis of Audio Content


In the universe of audio signals, the notions of syntax, semantics, pragmatics, etc. have been associated with a very limited set of domains, such as speech and language, and musical analysis, to some extent. However, research efforts focussing on formalizing general notions of syntactic or semantic structure for universal audio analysis have been relatively limited. Prior work in analysis of audio content has largely involved identifying certain sounds in recordings, and the analysis paradigm has typically relied on a shallow analysis framework that assumes that observed acoustics map directly to the semantics. We posit that sound possesses a hierarchical semantic structure, in reality, and a full understanding of the semantic content of recordings requires inferring this hierarchical structure. However, modeling this kind of structure in supervised settings would require richly annotated datasets, that do not currently exist and would require a significant annotation effort to develop. The main hypothesis that drives this dissertation is that sound has its own language and structure and that the deeper, underlying semantics can be modeled using a hierarchical framework. In this dissertation, we present such a hierarchical framework and develop formal models, designed for unsupervised or weakly supervised settings, for the same. We model the observed sound using sequences of lower level units. While these units may not carry semantic information individually, the sequences or distribution of these units should capture semantic information. In this language for sounds, the lower level units would be analogous to the alphabet. Such a representation of sound using a discrete sequence lends itself naturally to the hierarchical structure, where sequences of these lower level units can be mapped at higher levels to real events with clear semantic interpretations. Further, these event sequences should carry information about the overall semantic category of the audio. Depending on the restrictions we enforce at various levels of this structure, we can use such structured models to classify audio, detect sound events, segment files, or predict associated sound classes. In this dissertation, we present structured models for the various layers in the hierarchy. We then explore 2 different paradigms for inducing a hierarchy over the low-level acoustic units. Our proposed methods work unsupervised and in a task-agnostic manner, and we demonstrate empirically, using standard audio tasks, that semantic analysis of audio using this framework is feasible and that it outperforms other plausible semantically motivated schemes. Finally, we discuss some directions for future work, and present some preliminary formulations and experiments toward addressing them. The research pursued in this dissertation demonstrates that hidden semantic structure can be automatically discovered from weakly-labeled audio data. Further, we believe that the use of such semantically informed features will enable significant improvements over the state-of-the-art, for a number of different tasks.

Extracted Key Phrases

Cite this paper

@inproceedings{Chaudhuri2013StructuredMF, title={Structured Models for Semantic Analysis of Audio Content}, author={Sourish Chaudhuri and Rita Singh and Carlos de Juan Carbonell and Dan Ellis}, year={2013} }