• Corpus ID: 245906203

CLIP-Event: Connecting Text and Images with Event Structures

  title={CLIP-Event: Connecting Text and Images with Event Structures},
  author={Manling Li and Ruochen Xu and Shuohang Wang and Luowei Zhou and Xudong Lin and Chenguang Zhu and Michael Zeng and Heng Ji and Shih-Fu Chang},
Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding objects in images or entities in text, they often ignore the alignment at the level of events and their argument structures. In this work, we propose a contrastive learning framework to enforce vision-language pretraining models to comprehend events… 
Translation between Molecules and Natural Language
This work presents MolT5 – a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings, and considers several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation.
Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting
It is shown that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks, are conceptually similar, and it is proposed to combine meta- learning with prompt- based learning for multimodal FSOD without tuning.
Multi-Modal Causal Inference with Deep Structural Equation Models
It is empirically demonstrated on tasks in genomics and healthcare that unstructured data can be used to correct for diverse sources of confounding, potentially enabling the use of large amounts of data that were previously not used in causal inference.
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-Modal Knowledge Transfer
The experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings and cross-modal knowledge transfer using both images and captions with vision-language training objectives.
RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
  • X. Du, Zixuan Zhang, Heng Ji
  • Computer Science
  • 2022
We introduce RESIN-11, a new schema-guided event extraction and prediction system that can be applied to a large variety of newsworthy scenarios. The framework consists of two parts: (1) an


Cross-media Structured Common Space for Multimedia Event Extraction
A novel method is proposed, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space that enables exploiting available resources without explicit cross-media annotation.
Unified Vision-Language Pre-Training for Image Captioning and VQA
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Probing Image-Language Transformers for Verb Understanding
This work designs a benchmark focused on verbs called SVO-Probes for examining subject, verb, object triplets and evaluates the recent family of multimodal image–language transformers to investigate if the good performance of these models is due to learned representations that successfully relate different aspects of language to images.
VinVL: Revisiting Visual Representations in Vision-Language Models
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Situation Recognition: Visual Semantic Role Labeling for Image Understanding
This paper introduces situation recognition, the problem of producing a concise summary of the situation an image depicts including: (1) the main activity (e.g., clipping), (2) the participating
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Weakly Supervised Visual Semantic Parsing
A generalized formulation of SGG is proposed, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance, and the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations is proposed.
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
This paper proposes a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions that captures the most relevant contents of a video in a natural language description.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
FOIL it! Find One mismatch between Image and Language caption
It is demonstrated that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.