Use of Voicing and Pitch Information for Speaker Recognition

Abstract

Speech signal can be decomposed into two parts: the source part and the system part. The system part corresponds to the smooth envelope of the power spectrum and is used in the form of cepstral coef£cients in almost all the automatic speaker recognition systems reported in the literature. The source part contains information about voicing and pitch. Though this information is very important for human beings to identify a person from his/her voice, it is rarely used for automatic speaker recognition. In this paper, we propose a simple and reliable method to derive acoustic features based on voicing and pitch information and use them for automatic speaker recognition. We evaluate these features for speaker identi£cation using TIMIT, NTIMIT and IISC databases and demonstrate their effectiveness. INTRODUCTION A speech signal can be decomposed into two parts: the source part and the system part. The system part consists of the smooth envelope of the power spectrum and is represented in the form of cepstrum coef£cients, which can be computed by using either the linear prediction analysis or the mel £lter-bank analysis. Most of the automatic speaker recognition systems reported in the literature utilise the system information in the form of cepstral coef£cients. These systems perform reasonably well. The source information has been rarely used in the past for speaker recognition systems. The source contains information about pitch and voicing. This information is very important for humans to identify a person from his/her voice. A few studies have been reported where pitch information is used as a feature for speaker recognition. However results are not very encouraging. The main reason for this is that pitch estimation is always very much prone to errors. That is the pitch estimation methods are not very reliable, they introduce errors which affect the performance of the speaker recognition system. In this paper we propose a simple method for extracting the voicing and pitch information from the speech signal in a reliable manner. This is done by uniformly dividing the higher portion of the autocorrelation function in a number of parts and computing the maximum autocorrelation value in each of these parts. These maximum autocorrelation values (MACVs) are used as features for speaker recognition. We evaluate these MACV features on TIMIT, NTIMIT and IISC databases for speaker identi£cation task. In order to put these features in proper perspective, we compare their speaker identi£cation performance with that of pitch feature. COMPUTATION OF PITCH FEATURE As mentioned earlier, we compare the speaker identi£cation performance of the MACV features with the pitch feature. For determining the pitch value, we use two different methods: 1) the autocorrelation method and 2) the average magnitude difference function (AMDF) method. Consider a speech frame {s(n), n = 0, 1, ..., N − 1}. In the autocorrelation method, the autocorrelation function of the speech signal {s(n)} is computed as follows:

6 Figures and Tables

Cite this paper

@inproceedings{Wildermoth2000UseOV, title={Use of Voicing and Pitch Information for Speaker Recognition}, author={Brett R. Wildermoth and Kuldip K. Paliwal}, year={2000} }