Multimodal foundation models are better simulators of the human brain

  title={Multimodal foundation models are better simulators of the human brain},
  author={Haoyu Lu and Qiongyi Zhou and Nanyi Fei and Zhiwu Lu and Mingyu Ding and Jingyuan Wen and Changde Du and Xin Zhao and Haoran Sun and Huiguang He and J. Wen},
— Multimodal learning, especially large-scale multimodal pre-training, has developed rapidly over the past few years and led to the greatest advances in artificial intelligence (AI). Despite its effectiveness, understanding the underlying mechanism of multimodal pre-training models still remains a grand challenge. Revealing the explainability of such models is likely to enable breakthroughs of novel learning paradigms in the AI field. To this end, given the multimodal nature of the human brain, we… 
1 Citations

Figures from this paper

When Abstract Becomes Concrete: Naturalistic Encoding of Concepts in the Brain

Language is acquired and processed in complex and dynamic naturalistic contexts, involving simultaneous processing of connected speech, faces, bodies, objects, etc. How words and their associated



Cortical response to naturalistic stimuli is largely predictable with deep neural networks

This work builds group-level models of neural activity that incorporate several inductive biases about neural information processing, including hierarchical processing, temporal assimilation, and auditory-visual interactions, and illustrates that encoding models learn high-level concepts that generalize to task-bound paradigms.

What can 5 . 17 billion regression fits tell us about artificial models of the human visual system ?

A large-scale benchmarking analysis of 72 modern deep neural network models is performed to characterize with robust statistical power how differences in architecture and training task contribute to the prediction of human fMRI activity across 16 distinct regions of the human visual system.

Towards artificial general intelligence via a multimodal foundation model

This work develops a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks, and demonstrates that strong imagination ability is now possessed by the foundation model.

Unsupervised neural network models of the ventral visual stream

Neural network models learned with deep unsupervised contrastive embedding methods achieve neural prediction accuracy in multiple ventral visual cortical areas that equals or exceeds that of models derived using today’s best supervised methods and that the mapping of these neural network models’ hidden layers is neuroanatomically consistent across the ventral stream.

Using goal-driven deep learning models to understand sensory cortex

It is outlined how the goal-driven HCNN approach can be used to delve even more deeply into understanding the development and organization of sensory cortical processing.

Limits to visual representational correspondence between convolutional neural networks and the human brain

It is shown that CNNs do not fully capture higher level visual representations of real-world objects, nor those of artificial objects, either at lower or higher levels of visual representations, indicating some fundamental differences exist in how the brain and CNNs represent visual information.

Multisensory integration: methodological approaches and emerging principles in the human brain

Incorporating Context into Language Encoding Models for fMRI

The models built here show a significant improvement in encoding performance relative to state-of-the-art embeddings in nearly every brain area and suggest that LSTM language models learn high-level representations that are related to representations in the human brain.