Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

@article{Havard2019ModelsOV,
  title={Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese},
  author={William N. Havard and Jean-Pierre Chevrot and L. Besacier},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={8618-8622}
}
We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually… Expand

Figures, Tables, and Topics from this paper

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Attention-Based Keyword Localisation in Speech using Visual Grounding
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
...
1
2
...

References

SHOWING 1-10 OF 28 REFERENCES
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
Representations of language in a model of visually grounded speech signal
Encoding of phonology in a recurrent neural model of grounded speech
Representation of Linguistic Form and Function in Recurrent Neural Networks
Learning Word-Like Units from Joint Audio-Visual Analysis
Why Nouns Are Learned before Verbs: Linguistic Relativity Versus Natural Partitioning. Technical Report No. 257.
SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
...
1
2
3
...