Cross-modal Retrieval with Correspondence Autoencoder


The problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is constructed by correlating hidden representations of two uni-modal autoencoders. A novel optimal objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimization of correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning error makes hidden representations are good enough to reconstruct input of each modality. A parameter $\alpha$ is used to balance the representation learning error and the correlation learning error. Based on two different multi-modal autoencoders, Corr-AE is extended to other two correspondence models, here we called Corr-Cross-AE and Corr-Full-AE. The proposed models are evaluated on three publicly available data sets from real scenes. We demonstrate that the three correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multi-modal deep models on cross-modal retrieval tasks.

DOI: 10.1145/2647868.2654902

Extracted Key Phrases

12 Figures and Tables

Citations per Year

93 Citations

Semantic Scholar estimates that this publication has 93 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Feng2014CrossmodalRW, title={Cross-modal Retrieval with Correspondence Autoencoder}, author={Fangxiang Feng and Xiaojie Wang and Ruifan Li}, booktitle={ACM Multimedia}, year={2014} }