Corpus ID: 221140182

Advancing weakly supervised cross-domain alignment with optimal transport

  title={Advancing weakly supervised cross-domain alignment with optimal transport},
  author={Siyang Yuan and Ke Bai and Liqun Chen and Yizhe Zhang and Chenyang Tao and Chunyuan Li and Guoyin Wang and Ricardo Henao and Lawrence Carin},
Cross-domain alignment between image objects and text sequences is key to many visual-language tasks, and it poses a fundamental challenge to both computer vision and natural language processing. This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities, under a weakly-supervised setup, improving performance over state-of-the-art solutions. Our method builds upon recent advances in optimal transport (OT… Expand
2 Citations
G2DA: Geometry-Guided Dual-Alignment Learning for RGB-Infrared Person Re-Identification
A graph-enabled distribution matching solution, dubbed Geometry-Guided Dual-Alignment (GDA) learning, for RGB-IR ReID, and a Message Fusion Attention (MFA) mechanism to adaptively reweight the information flow of semantic propagation, effectively strengthening the discriminability of extracted semantic features. Expand
GDA: Geometry-Guided Dual-Alignment Learning for RGB-Infrared Person Re-Identification
A GeometryGuided Dual-Alignment learning framework (GDA) is presented, which jointly enhances modality-invariance and reinforces discriminability with human topological structure in features to boost the overall matching performance of RGB-IR ReID solutions. Expand


Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment
This work proposes a novel end-to-end model that uses caption- to-image retrieval as a downstream task to guide the process of phrase localization and infers the latent correspondences between regions-of-interest and phrases in the caption and creates a discriminative image representation using these matched RoIs. Expand
Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization
This work introduces a new architecture of this type, with a visual path that leverages recent spaceaware pooling mechanisms and a textual path which is jointly trained from scratch, which offers a versatile model. Expand
Improving Sequence-to-Sequence Learning via Optimal Transport
This work imposes global sequence-level guidance via new supervision based on optimal transport, enabling the overall characterization and preservation of semantic features in sequence-to-sequence models and shows consistent improvements over a wide variety of NLP tasks. Expand
Stacked Cross Attention for Image-Text Matching
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets. Expand
Dual-path Convolutional Image-Text Embeddings with Instance Loss
An end-to-end dual-path convolutional network to learn the image and text representations based on an unsupervised assumption that each image/text group can be viewed as a class, which allows the system to directly learn from the data and fully utilize the supervision. Expand
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding
A novel Knowledge Aided Consistency Network (KAC Net) is proposed which is optimized by reconstructing input query and proposal's information, and introduced a Knowledge Based Pooling (KBP) gate to focus on query-related proposals. Expand
Visual Semantic Reasoning for Image-Text Matching
A simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene that outperforms the current best method for image retrieval and caption retrieval on MS-COCO and Flickr30K datasets. Expand
Linking Image and Text with 2-Way Nets
A novel, bi-directional neural network architecture for the task of matching vectors from two data sources, enabling the use of Euclidean loss for correlation maximization and showing state of the art results on a number of computer vision matching tasks. Expand
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and the powerful ability of the cross-modal pre-training is shown. Expand
Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network
A novel weakly supervised model, Multi-scale Anchored Transformer Network (MATN), to accurately localize free-form textual phrases with only image-level supervision, which significantly outperforms the state-of-the-art methods. Expand