Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information

Abstract

A deep neural network (DNN) classifier based only on 40 mel-frequency cepstral coefficients (MFCCs) achieved 29.99% frame error rate (FER) and 16.86% segment error rate (SER) in recognizing five tonal categories in Mandarin Chinese broadcast news. With the addition of subband autocorrelation change detection (SACD) pitch-class features [1], the classifier scored 27.58% FER and 15.56% SER. These results are substantially better than the best previously reported results on broadcast news tone classification [2] and are also better than a human listener achieved in categorizing test stimuli created by amplitudeand frequency-modulating complex tones to match the extracted F0 and amplitude parameters [3]. The same DNN architecture scored substantially worse when trained and tested with SACD pitch-class parameters alone: 39.22% FER and 24.89% SER. RAPT F0 estimates are worse yet: 44.37% FER and 27.28% SER. The 40 MFCC parameters do not encode F0 in any obvious way and attempts to predict SACD or other pitch features from them work badly. These surprising results raise difficult questions for theories of Chinese tone.

Extracted Key Phrases

4 Figures and Tables

Cite this paper

@inproceedings{Ryant2013HighlyAM, title={Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information}, author={Neville Ryant and Malcolm Slaney and Mark Liberman and Elizabeth Shriberg and Jiahong Yuan}, year={2013} }