• Corpus ID: 228064166

Edited Media Understanding: Reasoning About Implications of Manipulated Images

  title={Edited Media Understanding: Reasoning About Implications of Manipulated Images},
  author={Jeff Da and Maxwell Forbes and Rowan Zellers and Anthony Zheng and Jena D. Hwang and Antoine Bosselut and Yejin Choi},
Multimodal disinformation, from `deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions… 

Figures and Tables from this paper

On the Diversity and Limits of Human Explanations

Inspired by prior work in psychology and cognitive sciences, existing human explanations in NLP are group into three categories: proximal mechanism, evidence, and procedure, which differ in nature and have implications for the resultant explanations.

Analyzing Commonsense Emergence in Few-shot Knowledge Models

The results show that commonsense knowledge models can rapidly adapt from limited examples, indicating that KG fine-tuning serves to learn an interface to encoded knowledge learned during pretraining.

Understanding Few-Shot Commonsense Knowledge Models

This work investigates training commonsense knowledge models in a fewshot setting with limited tuples per commonsense relation in the graph and finds that human quality ratings for knowledge produced from a few-shot trained system can achieve performance within 6% of knowledgeproduced from fully supervised systems.

Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing

This review identifies 61 datasets with three predominant classes of textual expla6 nations (highlights, free-text, and structured), organize the literature on annotating each type, identify strengths and shortcomings of existing collection methodologies, and give recommendations for collecting EXNLP datasets in the future.



Expressing Visual Relationships via Language

This work introduces a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions and proposes a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention and extended with dynamic relational attention.

From Recognition to Cognition: Visual Commonsense Reasoning

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

Learning to Globally Edit Images with Textual Description

This work shows how to globally edit images using textual instructions: given a source image and a textual instruction for the edit, generate a new image transformed under this instruction, and shows that Graph RNN improves performance.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Learning to Describe Differences Between Pairs of Similar Images

This paper collects a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage, and proposes a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences.

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes.

Fusion of Detected Objects in Text for Visual Question Answering

A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture.

A Corpus for Reasoning about Natural Language Grounded in Photographs

This work introduces a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges, and Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

UNITER: Learning UNiversal Image-TExt Representations

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.