• Corpus ID: 19085391

Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training

@inproceedings{Wang2018ShowRA,
  title={Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training},
  author={Jing Wang and Jianlong Fu and Jinhui Tang and Zechao Li and Tao Mei},
  booktitle={AAAI},
  year={2018}
}
Impressive image captioning results (i.e., an objective description for an image) are achieved with plenty of training pairs. In this paper, we take one step further to investigate the creation of narrative paragraph for a photo stream. This task is even more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. The difficulty can even be exacerbated by the limited training data, so that… 

Figures and Tables from this paper

Show, Reward, and Tell
TLDR
An attribute-based Hierarchical Generative model with Reinforcement Learning and adversarial training and considers the story generator and the reward critic as adversaries, which aims to create indistinguishable paragraphs to human-level stories.
No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
TLDR
Though automatic evaluation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that this approach achieves significant improvement in generating more human-like stories than SOTA systems.
Adversarial Inference for Multi-Sentence Video Description
TLDR
This work proposes to apply adversarial techniques during inference, designing a discriminator which encourages better multi-sentence video description, and finds that a multi-discriminator "hybrid" design, where each discriminator targets one aspect of a description, leads to the best results.
Emotion Reinforced Visual Storytelling
TLDR
This paper introduces the concept of emotion engaged in visual storytelling, and introduces the proposed model, which is able to generate stories based on not only emotions generated by the novel emotion generator, but also customized emotions.
Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation
TLDR
Empirical results from both automatic and human evaluations demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a strong flat deep reinforcement learning baseline.
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text
TLDR
An adversarial learning network is proposed for the task of multi-style image captioning (MSCap) with a standard factual image caption dataset and a multi-stylized language corpus without paired images to enable more natural and human-like captions.
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling
TLDR
This paper proposes a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models, and qualitatively shows the learned ability to interpolate storyline over visual gaps.
Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation
TLDR
A new design --- Convolutional Auto-Encoding (CAE) that purely employs convolutional and deconvolutional auto-encoding framework for topic modeling on the region-level features of an image and an architecture, namely CAE plus Long Short-Term Memory (dubbed as CAE-LSTM), that novelly integrates the learnt topics in support of paragraph generation.
Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent Experts
TLDR
The proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling and shows better descriptions in terms of content accuracy.
Contextualise, Attend, Modulate and Tell: Visual Storytelling
TLDR
This work proposes a novel framework Contextualize, Attend, Modulate and Tell (CAMT) that models the temporal relationship among the image sequence in forward as well as backward direction and evaluates the model on the Visual Storytelling Dataset employing both automatic and human evaluation measures.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks
TLDR
This paper proposes a novel joint learning model that can attend on the discovered semantic relation to produce a sentence sequence and maintain its consistence with the photo stream and integrates the two-step learning components into one single optimization formulation and train the network in an end-to-end manner.
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
TLDR
An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.
Expressing an Image Stream with a Sequence of Natural Sentences
TLDR
An approach for retrieving a sequence of natural sentences for an image stream that directly learns from vast user-generated resource of blog posts as text-image parallel training data and outperforms other state-of-the-art candidate methods.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
A Hierarchical Approach for Generating Descriptive Image Paragraphs
TLDR
A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.
Self-Critical Sequence Training for Image Captioning
TLDR
This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.
Mind's eye: A recurrent visual representation for image caption generation
TLDR
This paper explores the bi-directional mapping between images and their sentence-based descriptions with a recurrent neural network that attempts to dynamically build a visual representation of the scene as a caption is being generated or read.
Boosting Image Captioning with Attributes
TLDR
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
TLDR
To align movies and books, a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book are proposed.
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
...
1
2
3
4
...