PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling

  title={PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling},
  author={Xiaoxue Zang and Lijuan Liu and Maria Wang and Yang Song and Hao Zhang and Jindong Chen},
We present a new human-human dialogue dataset - PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task… 

Multimodal Dialogue Response Generation

A novel conversational agent is devised, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model, which achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.

Towards Building an Open-Domain Dialogue System Incorporated with Internet Memes

Experimental results on the MOD dataset demonstrate that the solutions presented can incorporate Internet memes into dialogue systems effectively and introduce an auxiliary task of emotion description prediction (EDP) to boost the performance of meme emotion classification.

Resolving the Human Subjects Status of Machine Learning's Crowdworkers

This analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies.

Multimodal Learning with Transformers: A Survey

A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.



Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

This work presents a novel task, Image Grounded Conversations (IGC), in which natural-sounding conversations are generated about a shared image, and introduces a new multiple reference dataset of crowd-sourced, event-centric conversations on images.

Image-Chat: Engaging Grounded Conversations

Automatic metrics and human evaluations of engagingness show the efficacy of this approach, and state-of-the-art performance on the existing IGC task is obtained, and the best performing model is almost on par with humans on the Image-Chat test set.

Stacked Cross Attention for Image-Text Matching

Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Deep Visual-Semantic Alignments for Generating Image Descriptions

  • A. KarpathyLi Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

DeViSE: A Deep Visual-Semantic Embedding Model

This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.