Inductive biases, pretraining and fine-tuning jointly account for brain responses to speech

@article{Millet2021InductiveBP,
  title={Inductive biases, pretraining and fine-tuning jointly account for brain responses to speech},
  author={Juliette Millet and J. R. King},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.01032}
}
Our ability to comprehend speech remains, to date, unrivaled by deep learning models. This feat could result from the brain’s ability to fine-tune generic sound representations for speech-specific processes. To test this hypothesis, we compare i) five types of deep neural networks to ii) human brain responses elicited by spoken sentences and recorded in 102 Dutch subjects using functional Magnetic Resonance Imaging (fMRI). Each network was either trained on an acoustics scene classification, a… 

Figures from this paper

Interpreting intermediate convolutional layers of CNNs trained on raw speech

Using the proposed technique, one can analyze how linguistically meaningful units in speech get encoded in different convolutional layers by linearly interpolating individual latent variables to marginal levels outside of the training range.

Neural dynamics of phoneme sequences reveal position-invariant code for content and order

The authors show that brain activity moves systematically within neural populations of auditory cortex, allowing accurate representation of a speech sound’s identity and its position in the sound sequence.

Deep Recurrent Encoder: an end-to-end network to model magnetoencephalography at scale

The Deep Recurrent Encoder (DRE) reliably predicts MEG responses to words with a three-fold improvement over classic linear methods and a simple variable importance analysis is described to investigate the MEG representations learnt by the model and recover the expected evoked responses to word length and word frequency.

Many but not all deep neural network audio models capture brain responses and exhibit hierarchical region correspondence

Evaluated brain-model correspondence for publicly available audio neural network models along with in-house models trained on four different tasks suggested the importance of task optimization for explaining brain representations and generally support the promise of deep neural networks as models of audition.

MEG-MASC: a high-quality magneto-encephalography dataset for evaluating natural speech processing

This work time-stamp the onset and offset of each word and phoneme in the metadata of the recording, and provides the Python code to replicate several validations analyses of the MEG evoked related fields such as the temporal decoding of phonetic features and word frequency.

Toward a realistic model of speech processing in the brain with self-supervised learning

The largest neuroimaging benchmark to date is shown, showing how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

Self-supervised models of audio effectively explain human cortical responses to speech

Overall, these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.

The Mapping of Deep Language Models on Brain Responses Primarily Depends on their Performance

Overall, this study evidences a partial convergence of language transformers to brain- like solutions, and shows how this phenomenon helps unravel the brain bases of natural language processing.

Successes and critical failures of neural networks in capturing human-like speech recognition

This work clarifies how influential speech manipulations in the literature relate to each other and to natural speech, shows the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, and demonstrates a crucial failure of all artificial systems to perceptually recover where humans do.

References

SHOWING 1-10 OF 47 REFERENCES

Performance-optimized hierarchical models predict neural responses in higher visual cortex

This work uses computational techniques to identify a high-performing neural network model that matches human performance on challenging object categorization tasks and shows that performance optimization—applied in a biologically appropriate model class—can be used to build quantitative predictive models of neural processing.

Mixedprecision training for nlp and speech recognition with openseq2seq, 2018

  • 2018

Searching through functional space reveals distributed visual, auditory, and semantic coding in the human brain

Evidence that visual and auditory features from deep neural networks and semantic features from a natural language processing model are more widely distributed across the brain than previously acknowledged is reported.

Cascaded Tuning to Amplitude Modulation for Natural Sound Recognition

The model modeled the function of the entire auditory system, that is, recognizing sounds from raw waveforms with as few anatomical or physiological assumptions as possible, and found that an auditory-system-like AM tuning emerges in the optimized DNN.

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.

The cortical organization of speech processing

A dual-stream model of speech processing is outlined that assumes that the ventral stream is largely bilaterally organized — although there are important computational differences between the left- and right-hemisphere systems — and that the dorsal stream is strongly left- Hemisphere dominant.

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.

Unidirectional monosynaptic connections from auditory areas to the primary visual cortex in the marmoset monkey

The existence of a direct, nonreciprocal projection from auditory areas to V1 in a different primate species, which has evolved separately from the macaque for over 30 million years, confirms the existence of early-stage audiovisual integration in primate sensory processing.