Automatic speech data clustering with human perception based weighted distance

Abstract

Speech data from internet contain different speaking styles relating to information genre, emotions, sentiments, speaker characters, etc. Automatic classification of such data remains a challenging problem due to the difficulty in defining the categories to characterize different speaking styles clearly. To address the problem, this paper proposes a method based on x-means clustering, an extended version of k-means without fixed number of classes, for the task. Moreover, x-means method clusters the data according to a pre-defined distance measurement considering different features. Current methods on distance measuring only focus on features themselves while ignoring the impact of these features on human perception. To derive a more reasonable distance measurement, this paper also proposes a human perception based weighted distance to capture the contribution of different acoustic features on human perception. In this way, the automatic classification of internet speech data will make use of the prior knowledge of human perception as well as capture the speaking style characteristics in different datasets with varying categories. Experiments on listening test demonstrate that it is useful to introduce the human perception prior knowledge in distance measurement and our proposed method outperforms the baseline with conventional Euclidian distance with 10% improvement in classification accuracy.

DOI: 10.1109/ISCSLP.2014.6936604

Extracted Key Phrases

2 Figures and Tables

Cite this paper

@article{Wu2014AutomaticSD, title={Automatic speech data clustering with human perception based weighted distance}, author={Xixin Wu and Zhiyong Wu and Jia Jia and Helen M. Meng and Lianhong Cai and Weifeng Li}, journal={The 9th International Symposium on Chinese Spoken Language Processing}, year={2014}, pages={216-220} }