Initialization Methods for an EMG-based Silent Speech Recognizer

Abstract

The application of surface electromyography (EMG) to automatic speech recognition is a relatively new research field which has been developing rapidly in recent years. Previous works in this area were usually limited to distinguishing whole utterances, but a short time ago first systems to recognize continuous speech from EMG signals have been developed. To recognize continuous speech, one uses a phoneme-based recognizer; its initialization requires exact time-alignments of the training data, which can be generated by audio signals that are parallely recorded by a conventional microphone and then processed by a conventional speech recognizer. The main application of surface EMG in speech recognition is the recognition of silent speech. In this situation, audio-generated time-alignments are not readily available. Therefore, it is necessary to find another way to initialize a silent speech EMG recognizer. Most notably, due to differences in articulation between audible speech and silent speech, simply using a recognizer trained on EMG signals of audible speech to recognize silent speech is not the best option. This work deals with initializing an EMG-based recognizer for silent speech. I compare different methods to achieve this goal, including the manual generation of time-alignments, and evaluate their performance based on the results of the final silent speech recognition step on a large corpus of EMG recordings of silent speech. I find that a recognizer can best be initialized by "Cross-Modal Labeling", which involves computing time-alignments for the EMG recordings of silent speech and then training a full EMG recognizer for silent speech recordings. Compared to the baseline method of training a recognizer on audible EMG and testing it on silent EMG ("Cross-Modal Testing"), which gives a WER of 91.0%, Cross-Modal Labeling yields a WER of 77.5%, which is a significant relative improvement of 14.8%. Moreover, an optimization of this process applying an iterated computation of time-alignments gives a Word Error Rate of 71.01%, which compared to the original Cross-Modal Labeling approach is a relative improvement of 8.42%. These results are the best results obtained so far on the silent speech part of the EMG-PIT corpus achieved.

1 Figure or Table

Cite this paper

@inproceedings{Schultz2010InitializationMF, title={Initialization Methods for an EMG-based Silent Speech Recognizer}, author={Tanja Schultz}, year={2010} }