Multimodal Neural Language Models


We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.

Extracted Key Phrases

7 Figures and Tables

Citations per Year

247 Citations

Semantic Scholar estimates that this publication has 247 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Kiros2014MultimodalNL, title={Multimodal Neural Language Models}, author={Ryan Kiros and Ruslan Salakhutdinov and Richard S. Zemel}, booktitle={ICML}, year={2014} }