Speech Decoding as Machine Translation

  title={Speech Decoding as Machine Translation},
  author={Joseph G. Makin and David A. Moses and Edward F. Chang},
  journal={SpringerBriefs in Electrical and Computer Engineering},



Machine translation of cortical activity to text with an encoder–decoder framework

It is shown how to decode the electrocorticogram with high accuracy and at natural-speech rates, and how decoding with limited data can be improved with transfer learning, by training certain layers of the network under multiple participants’ data.

Real-time decoding of question-and-answer speech dialogue using human cortical activity

It is demonstrated that the context of a verbal exchange can be used to enhance neural decoder performance in real time and Contextual integration of decoded question likelihoods significantly improves answer decoding.

Speech synthesis from neural decoding of spoken sentences

A neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech, which is readily identified and transcribed by listeners and could synthesize speech when a participant silently mimed sentences.

Speech synthesis from ECoG using densely connected 3D convolutional neural networks.

This is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks, and uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant.

Multi-task Learning

Intuitively, it would seem that learning related tasks jointly would help to uncover common knowledge and improve generalization performance, and this intuition is supported by empirical evidence provided by recent developments in transfer learning and multi-task learning.

Decoding Speech from Intracortical Multielectrode Arrays in Dorsal “Arm/Hand Areas” of Human Motor Cortex

Recorded from two 96- electrode arrays chronically implanted into the ‘hand knob’ area of motor cortex while a person with tetraplegia spoke, this suggests that high-fidelity speech prostheses may be possible using large-scale intracortical recordings in motor cortical areas involved in controlling speech articulators.

Differential Representation of Articulatory Gestures and Phonemes in Precentral and Inferior Frontal Gyri

This work investigates the cortical representation of articulatory gestures and phonemes in ventral precentral and inferior frontal gyri in men and women and suggests that speech production shares a common cortical representation with that of other types of movement, such as arm and hand movements.

Toward Human Parity in Conversational Speech Recognition

A human error rate on the widely used NIST 2000 test set for commercial bulk transcription is measured, suggesting that, given sufficient matched training data, conversational speech transcription engines are approximating human parity in both quantitative and qualitative terms.

Convolutional Sequence to Sequence Learning

This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.