Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval

  title={Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval},
  author={Zhenguo Yang and Zehang Lin and Peipei Kang and Jianming Lv and Qing Li and Wenyin Liu},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  pages={1 - 22}
In this article, we propose to learn shared semantic space with correlation alignment (S3CA) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data. In the context of cross-modal (event) retrieval, we design a neural network with convolutional layers and fully connected layers to extract features for images, including images on Flickr-like social media. Simultaneously, we exploit a fully… Expand
MMED: A Multi-domain and Multi-modality Event Dataset
A multi-domain and multi-modality event dataset, containing 25,165 textual news articles collected from hundreds of news media sites and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events is constructed and released. Expand
Deep Semantic Space with Intra-class Low-rank Constraint for Cross-modal Retrieval
In this paper, a novel Deep Semantic Space learning model with Intra-class Low-rank constraint (DSSIL) is proposed for cross-modal retrieval, which is composed of two subnetworks forExpand
Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval
Two deep models based on intra-class low-rank regularization based on ILCMR and Semi-ILCMR are proposed for supervised and semi-supervised cross-modal retrieval, respectively, demonstrating the superiority of these methods over other state-of-the-art methods. Expand
Comparative analysis on cross-modal information retrieval: A review
Comparative analysis of several cross-modal representations and the results of the state-of-the-art methods when applied on benchmark datasets have been discussed and open issues are presented to enable the researchers to a better understanding of the present scenario and to identify future research directions. Expand
Learning discriminative hashing codes for cross-modal retrieval based on multi-view features
A discrete hashing learning framework that jointly performs classifier learning and subspace learning is proposed to complete multiple search tasks simultaneously and indicates the superiority of the method compared with the state-of-the-art methods. Expand
Cross-Modal and Multimodal Data Analysis Based on Functional Mapping of Spectral Descriptors and Manifold Regularization
This work proposes a manifold regularization framework based on the functional mapping between SGWS descriptors (FMBSD) for finding the pointwise correspondences of multimodal heterogeneous modalities (MCPC) and experimental results indicate their effectiveness and superiority over state-of-the-art methods. Expand
Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
This work proposes an end-to-end non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once), which can capture the token relations by self-attention on the aggregated hidden representations from the whole speech signal rather than autoregressive modeling on tokens. Expand
Repurpose Image Identification for Fake News Detection
  • Steven Jia He Lee, Tangqing Li, Wynne Hsu, M. Lee
  • Computer Science
  • DEXA
  • 2021


Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning
The DSS outperforms the state-of-the-art approaches on both the Pascal Sentence dataset and the Wiki-Flickr event dataset and can be mapped into a high-level semantic space, in which the distance between data samples can be measured straightforwardly for cross-model event retrieval. Expand
Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia
A brain inspired cross-media retrieval framework to learn rich semantic embeddings of multimedia, which combines the visual and descriptive senses for an image from the view of human perception via a joint model, called multi-sensory fusion network (MSFN). Expand
CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network
This paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network and compares with 13 state-of-the-art methods on 6 widely-used cross- modal datasets the experimental results show the CCL approach achieves the best performance. Expand
Shared Multi-View Data Representation for Multi-Domain Event Detection
This paper presents an event detection framework to discover real-world events from multiple data domains, including online news media and social media, and proposes class-wise residual models designed to discover the events underlying the data based on the reconstruction residuals. Expand
Cross-Modal Retrieval via Deep and Bidirectional Representation Learning
A deep and bidirectional representation learning model is proposed to address the issue of image-text cross-modal retrieval and shows that the proposed architecture is effective and the learned representations have good semantics to achieve superior cross- modal retrieval performance. Expand
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics, involving two-branch deep neural networks for audio modality and text modality (lyrics) and two significant contributions are made in the audio branch. Expand
Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks
The cross-media multiple deep network (CMDN) is proposed to exploit the complex cross- media correlation by hierarchical learning and achieves better performance comparing with several state-of-the-art methods on 3 extensively used cross- Media datasets. Expand
Aggregating Image and Text Quantized Correlated Components
A new representation method is put forward that aggregates information provided by the projections of both modalities on their aligned subspaces and suggests a method relying on neighborhoods in these subspaced to complete uni-modal information. Expand
Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval
A novel deep framework of multi-networks joint learning for large-scale cross-modal retrieval that can simultaneously achieve specific features adapting to cross- modal task and learn a shared latent space for images and sentences is proposed. Expand
A new approach to cross-modal multimedia retrieval
It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy and are shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task. Expand