Initialization Methods for an EMG-based Silent Speech Recognizer

Abstract

The application of surface electromyography (EMG) to automatic speech recognition is a relatively new research field which has been developing rapidly in recent years. Previous works in this area were usually limited to distinguishing whole utterances, but a short time ago first systems to recognize continuous speech from EMG signals have been developed. To recognize continuous speech, one uses a phoneme-based recognizer; its initialization requires exact time-alignments of the training data, which can be generated by audio signals that are parallely recorded by a conventional microphone and then processed by a conventional speech recognizer. The main application of surface EMG in speech recognition is the recognition of silent speech. In this situation, audio-generated time-alignments are not readily available. Therefore, it is necessary to find another way to initialize a silent speech EMG recognizer. Most notably, due to differences in articulation between audible speech and silent speech, simply using a recognizer trained on EMG signals of audible speech to recognize silent speech is not the best option. This work deals with initializing an EMG-based recognizer for silent speech. I compare different methods to achieve this goal, including the manual generation of time-alignments, and evaluate their performance based on the results of the final silent speech recognition step on a large corpus of EMG recordings of silent speech. I find that a recognizer can best be initialized by "Cross-Modal Labeling", which involves computing time-alignments for the EMG recordings of silent speech and then training a full EMG recognizer for silent speech recordings. Compared to the baseline method of training a recognizer on audible EMG and testing it on silent EMG ("Cross-Modal Testing"), which gives a WER of 91.0%, Cross-Modal Labeling yields a WER of 77.5%, which is a significant relative improvement of 14.8%. Moreover, an optimization of this process applying an iterated computation of time-alignments gives a Word Error Rate of 71.01%, which compared to the original Cross-Modal Labeling approach is a relative improvement of 8.42%. These results are the best results obtained so far on the silent speech part of the EMG-PIT corpus achieved. Acknowledgements I'd like to firstly give my best thanks to Mr. Michael Wand; he has advised me during my whole work. And also my best thanks to Prof. Tanja Schultz, all of the teachers, employees and students at the Cognitive Systems Lab. I'd like to give my best thanks to my parents and sister too; they have …

2 Figures and Tables

Showing 1-10 of 72 references

Measurement of human locomotion

  • V Medved
  • 2001
Highly Influential
5 Excerpts

Anatomic and physiologic basis for surface electromyography

  • R Lamb
  • 1992
Highly Influential
5 Excerpts

Sprachliche Mensch-Maschine-Kommunikation

  • I Rogina
  • 2005
Highly Influential
5 Excerpts

Important factors in surface EMG measurement

  • D Scott
  • 2003
Highly Influential
10 Excerpts

Electromyography in the Biomechanical analysis of human movement and its clinical application. Gait and Posture

  • R B Kleissen
  • 1998
Highly Influential
3 Excerpts

Recording techniques

  • G Soderberg
  • 1992
Highly Influential
4 Excerpts

Duchenne de Boulogne

  • D Boulogne
  • 2010
2 Excerpts

Skeletal, Muscular, and Integumentary Systems

  • Karthi
  • 2010